Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.

Join them; it only takes a minute:

Sign up
Join the Stack Overflow community to:
  1. Ask programming questions
  2. Answer and help your peers
  3. Get recognized for your expertise

Could someone tell why Arrayfun is much faster than a for loop on GPU? (not on CPU, actually a For loop is faster on CPU)

Arrayfun:

x = parallel.gpu.GPUArray(rand(512,512,64));
count = arrayfun(@(x) x^2, x);

And equivalent For loop:

for i=1:size(x,1)*size(x,2)*size(x,3)
  z(i)=x(i).^2;        
end

Is it probably because a For loop is not multithreaded on GPU? Thanks.

share|improve this question
    
Is the z(i) array preallocated? Also, just curious, what GPU are using (e.g. NVIDIA GTX680, or some other model number)? – solvingPuzzles Feb 9 '13 at 19:30
up vote 3 down vote accepted

I don't think your loops are equivalent. It seems you're squaring every element in an array with your CPU implementation, but performing some sort of count for arrayfun.

Regardless, I think the explanation you're looking for is as follows:

When run on the GPU, you code can be functionally decomposed -- into each array cell in this case -- and squared separately. This is okay because for a given i, the value of [cell_i]^2 doesn't depend on any of the other values in other cells. What most likely happens is the array get's decomposed into S buffers where S is the number of stream processing units your GPU has. Each unit then computes the square of the data in each cell of its buffer. The result is copied back to the original array and the result is returned to count.

Now don't worry, if you're counting things as it seems *array_fun* is actually doing, a similar thing is happening. The algorithm most likely partitions the array off into similar buffers, and, instead of squaring each cell, add the values together. You can think of the result of this first step as a smaller array which the same process can be applied to recursively to count the new sums.

share|improve this answer
    
Thanks David. very clear. Probably this is what's happening in part, but I guess there must be something else too. On GPU the speed gain between Arrayfun and a For loop is >100x. The GPU I am using does only have 16 compute units (with 32 cores each) ... I will do some more testing. – Maiss Apr 14 '12 at 5:03
    
Keep in mind, your GPU is not identical to your processor either. In short, your GPU is optimized for certain types of calculations (such as floating point operations and very quick integer arithmetic). Further, the memory timing is faster on most current GPUs. The speed increase may not only be attributed to the parallelism, but to the relative locality of the GPU memory. You can do horrible things to the running time of array walking algorithms on a CPU if you access the elements in a bad (cache exploding) order. – dcow Apr 14 '12 at 5:13
1  
@Maiss: I've heard that 100 to 1 number a lot for GPU/CPU difference, in the event that the GPU can be properly used. This is definitely one of those times. – PearsonArtPhoto Apr 14 '12 at 12:50
    
Actually it is~ 3000-4000x with (I7 950 and GTX 580) which I doubt it is the actual GPU/CPU difference. The problem must come from the way For/Arrayfun operate. I can tell that Arrayfun (the GPU implementation) is very private which mean that it may be distributed equally to all the multiprocessing units (16) and then to the cores (32 for each MP unit). But still doesn't explain the difference. Perhaps Arrayfun uses some low level C code. – Maiss Apr 14 '12 at 17:34
    
@Maiss The functions are almost certainly compiled to C code first. What we're saying is you shouldn't be so surprised at the difference. The's why there's so much buzz about GPU computing right now. – dcow Apr 14 '12 at 17:47

As per the reference page here http://www.mathworks.co.uk/help/toolbox/distcomp/arrayfun.html, "the MATLAB function passed in for evaluation is compiled for the GPU, and then executed on the GPU". In the explicit for loop version, each operation is executed separately on the GPU, and this incurs overhead - the arrayfun version is one single GPU kernel invocation.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.