How does one write efficient Dynamic Programming algorithms in Haskell?

Question

I've been playing around with dynamic programming in Haskell. Practically every tutorial I've seen on the subject gives the same, very elegant algorithm based on memoization and the laziness of the Array type. Inspired by those examples, I wrote the following algorithm as a test:

-- pascal n returns the nth entry on the main diagonal of pascal's triangle
-- (mod a million for efficiency)
pascal :: Int -> Int
pascal n  = p ! (n,n) where
           p = listArray ((0,0),(n,n)) [f (i,j) | i <- [0 .. n], j <- [0 .. n]]

           f :: (Int,Int) -> Int
           f (_,0) = 1
           f (0,_) = 1
           f (i,j) = (p ! (i, j-1) + p ! (i-1, j)) `mod` 1000000

My only problem is efficiency. Even using GHC's -O2, this program takes 1.6 seconds to compute pascal 1000, which is about 160 times slower than an equivalent unoptimized C++ program. And the gap only widens with larger inputs.

It seems like I've tried every possible permutation of the above code, along with suggested alternatives like the data-memocombinators library, and they all had the same or worse performance. The one thing I haven't tried is the ST Monad, which I'm sure could be made to run the program only slighter slower than the C version. But I'd really like to write it in idiomatic Haskell, and I don't understand why the idiomatic version is so inefficient. I have two questions:

Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.
Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?

Thanks a lot.

Edit: The array module used is the standard Data.Array

How does the performance compare if you just use "f (i,j) = (f (i, j-1) + f (i-1,j))" and ditch p entirely? I don't understand how going through p is supposed to help, though I admit I'm not very experienced with Haskell.
@DGH: The point of the array is to only calculate each result once. Without the array the algorithm would be brute force - not DP.
The first thing I notice is the gratuitous tuples, instead of a multi-argument function.

Louis Wasserman · Accepted Answer · 2012-06-06 00:14:48Z

up vote 15 down vote accepted

Well, the algorithm could be designed a little better. Using the vector package and being smart about only keeping one row in memory at a time, we can get something that's idiomatic in a different way:

{-# LANGUAGE BangPatterns #-}
import Data.Vector.Unboxed
import Prelude hiding (replicate, tail, scanl)

pascal :: Int -> Int
pascal !n = go 1 ((replicate (n+1) 1) :: Vector Int) where
  go !i !prevRow
    | i <= n    = go (i+1) (scanl f 1 (tail prevRow))
    | otherwise = prevRow ! n
  f x y = (x + y) `rem` 1000000

This optimizes down very tightly, especially because the vector package includes some rather ingenious tricks to transparently optimize array operations written in an idiomatic style.

edited Jun 6 at 0:14

answered Jun 6 at 0:06

Louis Wasserman
28.7k22957

	Don't forget the modulus, that's what takes the most time in this. – Daniel Fischer Jun 6 at 0:12
	Hmmmm. I am not convinced that the modulus took more time than the lazy thunk overhead in the original implementation, but I'll grant that it'll be the bottleneck in this implementation. – Louis Wasserman Jun 6 at 0:15
	In the original, the modulus is not a big deal. But when dealing with fairly optimised vector/STUArray algorithms, it is. Your code ran (for n = 4000) in 0.04s here without the modulus, in 0.26s with. – Daniel Fischer Jun 6 at 0:17
	That matches my assessment as well; sorry for the confusion. – Louis Wasserman Jun 6 at 0:32
	Well, I can't claim that "in this" was particularly unambiguous. – Daniel Fischer Jun 6 at 0:37

show 2 more comments

feedback

Daniel Wagner · Answer 2 · 2012-06-06 01:23:40Z

The trick is to think about how to write the whole damn algorithm at once, and then use unboxed vectors as your backing data type. For example, the following runs about 20 times faster on my machine than your code:

import qualified Data.Vector.Unboxed as V

combine :: Int -> Int -> Int
combine x y = (x+y) `mod` 1000000

pascal n = V.last $ go n where
    go 0 = V.replicate (n+1) 1
    go m = V.scanl1 combine (go (m-1))

I then wrote two main functions that called out to yours and mine with an argument of 4000; these ran in 10.42s and 0.54s respectively. Of course, as I'm sure you know, they both get blown out of the water (0.00s) by the version that uses a better algorithm:

pascal' :: Integer -> Integer
pascal :: Int -> Int
pascal' n = product [n+1..n*2] `div` product [2..n]
pascal = fromIntegral . (`mod` 1000000) . pascal' . fromIntegral

Daniel Fischer · Answer 3 · 2012-06-06 01:21:35Z

1 Why is the above code so inefficient? It seems like a straightforward iteration through a matrix, with an arithmetic operation at each entry. Clearly Haskell is doing something behind the scenes I don't understand.

The problem is that the code writes thunks to the array. Then when entry (n,n) is read, the evaluation of the thunks jumps all over the array again, recurring until finally a value not needing further recursion is found. That causes a lot of unnecessary allocation and inefficiency.

The C++ code doesn't have that problem, the values are written, and read directly without requiring further evaluation. As it would happen with an STUArray. Does

p = runSTUArray $ do
    arr <- newArray ((0,0),(n,n)) 1
    forM_ [1 .. n] $ \i ->
        forM_ [1 .. n] $ \j -> do
            a <- readArray arr (i,j-1)
            b <- readArray arr (i-1,j)
            writeArray arr (i,j) $! (a+b) `rem` 1000000
    return arr

really look so bad?

2 Is there a way to make it much more efficient (at most 10-15 times the runtime of a C program) without sacrificing its stateless, recursive formulation (vis-a-vis an implementation using mutable arrays in the ST Monad)?

I don't know of one. But there might be.

Addendum:

Once one uses STUArrays or unboxed Vectors, there's still a significant difference to the equivalent C implementation. The reason is that gcc replaces the % by a combination of multiplications, shifts and subtractions (even without optimisations), since the modulus is known. Doing the same by hand in Haskell (since GHC doesn't [yet] do that),

-- fast modulo 1000000
-- for nonnegative Ints < 2^31
-- requires 64-bit Ints
fastMod :: Int -> Int
fastMod n = n - 1000000*((n*1125899907) `shiftR` 50)

gets the Haskell versions on par with C.

I don't think this is a really helpful answer. The questioner stated that they knew a STU approach would be more efficient, but wanted to know if an approach commonly used in tutorials could ever be made efficient. This answer didn't answer either of his questions. I think it is an interesting question, as the program does run very slowly. It doesn't give much credit to the technique he showed if it runs as slow as it does. For comparision, I wrote a ruby version with the same algorithm, that is only twice as slow as the ghc version compiled with -O2!
The answer explains why the approach is slow. I think that's important to understand.
Yes true. I suppose the real answer to this question is quite possibly "The technique shown using listArray is inherently inefficient", which is an important observation (since it makes the technique useless for most problems it is used on).
+1 + 29,999 = 30,009. :) "Does" (...runSTU...) "really look so bad?" Yes. I would much rather write in C/pythonic notation and have Haskell figure out its monadic translation by itself. It does figure out the types, why not monads? See e.g. this mutable map-based python primes generator for a clear syntax. Imagine what the Haskell monadic code translation would look like.
Well, @Will, it sure would be convenient if the compiler could figure out how to make the code efficient by itself. But that ain't gonna happen anytime soon. So for the time being, you have to help. The Python thingy suffers from the same problem, it's nice and short, but when it comes down to production, it's dog slow (maybe the dog is much smaller with CPython or PyPy or so, dunno what they can do). If you want it fast, you have to tell the compiler how to do it in more detail, in any language.

asked	5 months ago
viewed	831 times
active	4 months ago

How does one write efficient Dynamic Programming algorithms in Haskell?

3 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged performance algorithm haskell functional-programming or ask your own question.

Hello World!

Community Bulletin

How does one write efficient Dynamic Programming algorithms in Haskell?

3 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged performance algorithm haskell functional-programming or ask your own question.

Hello World!

Community Bulletin

Related