Tell me more ×
Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free, no registration required.

I have a list of objects that all require the exact same filtering, basically a set of conditionals in a function which outputs if the object is "good" or "bad". I want to keep all of my "good" objects in the list. Lets say I have 1000 objects in this list and the function I want to filter on in written. What languages/functionality would maximize how parallel this operation could be? I have played with python's multiprocessing, which does improve performance, but still left me unsatisfied. I suppose this is an "embarrassingly parallel" problem as no state needs to be shared.

share|improve this question
What was the performance improvement when using multiprocessing? Why were you unsatisfied? What are your performance constraints? What sort of performance characteristics does your filter function have? 1000 items isn't very many, so either you have tight performance constraints or your filter function isn't very fast. Is your filter function I/O bound or CPU bound? – Greg Hewgill Apr 30 at 0:38
I have tight performance constraints, around 20 milliseconds. I would say it is CPU bound because everything is in the list that I need to process (maybe that isn't a good assumption?). On my computer, which is 4 cores 2.1 Gz, I saw a performance gain of about 3x, which is a great improvement, but I just wonder why it cant be many x faster if I'm executing the same operation on all of these items. – tonyl7126 Apr 30 at 0:49
2  
Your CPU still has to do the work, I don't know why you would expect "many x faster". With 4 cores and a CPU-bound task, a 4x performance improvement would be the maximum possible, and I would expect something like 3x as more of a real-world improvement because of the extra work involved in coordinating the parallelism. – Greg Hewgill Apr 30 at 1:00
you can also be memory bound (this happens with a lot of cache misses), for some speedup you can profile the filter so the more common early return are done first – ratchet freak Apr 30 at 11:03

1 Answer

up vote 2 down vote accepted

I'm not entirely sure as to what to answer to this question, but two immediate possibilities come to my mind:

Scala

As you are already describing the problem in a functional way and explicitly state that you have no shared state, the parallel collection API of Scala seems like a perfect match for the problem. It may not provide top performance though (see below for that), but will certainly require only a tiny amount of effort to setup. Something like the following is all you need really:

myList.par.filter(goodBadCondition)

The .par turns any collection into a parallelized one, which executes operations like filter in parallel using the JVM's fork/join-support. Note that the parallel collection operations are stable, i.e. the order of the original list elements remains the same in the result (for those elements that passed the condition).

GPU based

If you really want to squeeze out a maximum of performance, then letting the GPU perform the filtering seems the perfect way to go. In this case, most of your speed loss comes from having to send data to/from the GPU. While the operation you want to perform is a perfect match for something like Cuda, I am not entirely sure that it is worthwhile. For example, if your condition is very simple and extremely fast to check, then the overhead of having to transfer the whole list to the GPU may not pay off against a quad core CPU which can process the whole list in its memory.

On the other hand, if your filter condition is costly to check, then you can check way more elements at the same time as compared to a CPU. Depending on the card you use, you can get from several dozens up to hundreds of GPU cores to perform the work.

However, the implementational effort of this approach is far higher than the one above. You also need to have a lot more knowledge about GPU programming and you will need a lot of code for such a supposedly trivial task (newer frameworks to simplify this are starting to appear though).

share|improve this answer
Frank, I actually just downloaded the cuda toolkit last night because it does seem like it could be a good fit for the task (never having done any GPU programming, though). As for Scala, could this also be applied to Clojure as well, I'm already familiar with Clojure's "pmap" function. Is there any reason Scala should be a better fit than Clojure? – tonyl7126 Apr 30 at 16:55
No reason at all. Scala is just my personal preference, but the above could be replaced by any collection library that supports transparent parallel higher-order functions. – Frank May 2 at 5:09

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.