Can we benchmark how fast CUDA or OpenCL is compared to CPU performance?

Question

How much faster can an algorithm on CUDA or OpenCL code run compared to a general single processor core? (considering the algorithm is written and optimized for both the CPU and GPU target).

I know it depends on both the graphics card and the CPU, but say, one of the fastest GPUs of NVIDIA and a (single core of a) Intel i7 processor ?

And I know it also depends on the type of algorithm.

I do not need a strict answer, but experienced examples like: for a image manipulation algorithm using double-precision floating point and 10 operations per pixel took first 5 minutes and now runs in x seconds using this hardware.

I have changed the question so it opens the possibility to say: "No, it's not possible" or "yes, there is a benchmark suite that does these kind of comparisons", etc.

Marco van de Voort · Answer 1 · 2010-12-07 10:57:49Z

Your question is overly broad, and very difficult to answer. Moreover only a small percentage of algorithms (the ones that deal without much shared state) are feasable with GPUs.

But I do want to urge you to be critical about claims. I'm in imageprocessing, and read many an article on the subject, but quite often in the GPU case, the time to upload input data to the GPU, and download the results back to main memory is not included in the calculation of the factor.

While there are a few cases where this doesn't matter (both are small or there is a second stage calculation that further reduces the result in size), usually one does have to transfer the results and initial data.

I've seen this turning a claimed plus into a negative, because the upload/download time alone was longer than the main CPU would require to do the calculation.

Pretty much the same thing applies to combining results of different GPU cards.

Thanks, mentioning the up/downloading is valuable to know. And giving the answer that it is way too broad also.
Yep i can confirm that up/downloading is slower than processing on the cpu in the end. But another thing to consider is that you can use OpenCL on a cpu device to utilize multiple processors and vector instructions (SSEx) in a pretty simple way. I've implemented some image processing functions in OpenCL and run them on the CPU which works great. (Additional plus: use SSE in Java via OpenCL on CPU)

peakxu · Answer 2 · 2011-02-28 15:50:50Z

It depends very much on the algorithm and how efficient the implementation can be.

Overall, it's fair to say that GPU is better at computation than CPUs. Thus, an upper bound is to divide the theoretical GFlops rating of a top end GPU by a top end CPU. You can do similar computation for theoretical memory bandwidth.

For example, 1581.1 GFlops for a GTX580 vs. a 107.55 GFLOPS for i7 980XE. Note that the rating for GTX580 is for single precision. I believe you need to cut that down by a factor of 4 for Fermi class non-Tesla to get to the double precision rating. So in this instance, you might expect roughly 4x.

Caveats on why you might do better (or see results which claim far bigger speedups):

GPUs has better memory bandwidth than CPU once the data is on the card. Sometimes, memory bound algorithms can do well on the GPU.
Clever use of caches (texture memory etc.) which can let you do better than advertised bandwidth.
Like Marco says, the transfer time didn't get included. I personally always include such time in my work and thus have found that the biggest speedups I've seen to be in iterative algorithms where all the data fits on the GPU (I've gotten over 300x on a midrange CPU to midrange GPU here personally).
Apples to orange comparisons. Comparing a top end GPU vs. a low end CPU is inherently unfair. The rebuttal is that a high end CPU costs much more than a high end GPU. Once you go to a GFlops/$ or GFlops/Watt comparison, it can look much more favorable to the GPU.

shoc · Answer 3 · 2010-11-24 20:52:40Z

A new benchmark suite called SHOC (Scalable Heterogeneous Computing) from Oak Ridge National Lab and Georgia Tech has both OpenCL and CUDA implementations of many important kernels. You can download the suite from http://bit.ly/shocmarx. Enjoy.

bjoernz · Answer 4 · 2010-11-24 15:12:05Z

I think that this video introduction to OpenCL gives a good answer to your question in the first or second episode (I do not remember). I think it was at the end of the first episode...

In general it depends on how well you can "parallelize" the problem. The problem size itself is also a factor, because it costs time to copy the data to the graphics card.

grrussel · Answer 5 · 2010-11-24 15:10:27Z

Your question is in general, hard to answer; there are simply many different variables that make it hard to give answers that are either accurate, or fair.

Notably, you are comparing both 1) choice of algorithm 2) relative performance of hardware 3) compiler optimisation ability 4) choice of implementation languages and 5) efficiency of algorithm implementation, all at the same time...

Note that, for example, different algorithms may be preferable on GPU vs CPU; and data transfers to and from GPU need to be accounted for in timings, too.

AMD has a case study (several, actually) in OpenCL performance for OpenCL code executing on the CPU and on the GPU. Here is one with performance results for sparse matrix vector multiply.

jkff · Answer 6 · 2010-11-24 15:04:51Z

up vote 0 down vote

I've seen figures ranging from 2x to 400x. I also know that the middle-range GPUs cannot compete with high-range CPUs in double-precision computation - MKL on a 8-core Xeon will be faster than CULA or CUBLAS on an $300 GPU.

OpenCL is anecdotally much slower than CUDA.

answered Nov 24 '10 at 15:04

jkff
2,296725

3

I’ve seen figures from 0.1x to 400x. It’s important to recognize that GPUs aren’t well-suited for every task and that even a well-optimized algorithm may actually be slower (low computational density, large data set, low locality of reference, large interdependence, divergent control flow). – Konrad Rudolph Nov 24 '10 at 15:22

1

OpenCL usually performs quite on par with CUDA nowadays. It's not exactly a surprise, they are architecturally very similar and even the implementations share a lot (the PTX IR, for example). Please also consider that OpenCL favors correctness over performance more than CUDA by default. – dietr Nov 24 '10 at 15:35

feedback

asked	1 year ago
viewed	2,518 times
active	1 year ago

Can we benchmark how fast CUDA or OpenCL is compared to CPU performance?

6 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged c cuda opencl gpu-programming cpu-speed or ask your own question.

Hello World!

Visit Chat

Can we benchmark how fast CUDA or OpenCL is compared to CPU performance?

6 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged c cuda opencl gpu-programming cpu-speed or ask your own question.

Hello World!

Visit Chat

Related