Calculating sum of primes using the CPU and GPU

Question

This is a little baffling to me as to why the CUDA code runs about twice as slow as the CPU version. I am just counting all the primes from 0 to (512 * 512 * 512). The CPU version executed in about 97 seconds whereas the GPU version took 182 seconds.

CPU: Intel Core i7 @ 4 GHz
GPU: NVIDIA GTX 960

Any ideas why?

#include <cuda.h>
#include <iostream>
#include <cstdint>
#include <stdio.h>
#include <ctime>
#include <vector>
#include <cstdlib>
#include <climits>

using namespace std;



__host__ __device__  bool is_prime(uint32_t n)
{
    if(n == 2)
        return true;
    if(n % 2 == 0)
        return false;
    uint32_t sr = sqrtf(n);

    for(uint32_t i = 3; i <= sr; i += 2)
        if(n % i == 0)
            return false;
    return true;
}

__global__ void prime_sum(unsigned int* count)
{
    uint32_t n = (blockIdx.y * gridDim.y + blockIdx.x) * blockDim.x + threadIdx.x;
    if(is_prime(n))
        atomicAdd(count, 1);
}

CPU version

int main()
{
    time_t start = time(0);
    int pcount = 0;
    for(uint32_t i = 0; i < (512 * 512 * 512); i++)
    {
        if(is_prime(i)) pcount++;
    }
    start = time(0) - start;
    std::cout << pcount << "\t" << start << std::endl;

    return 0;
}

CUDA version

int main()
{
    time_t start = time(0);
    unsigned int* sum_d;
    cudaMalloc(&sum_d, sizeof(unsigned int));
    cudaMemset(sum_d, 0, sizeof(unsigned int));

    prime_sum<<< dim3(512, 512), 512 >>>(sum_d);

    unsigned int sum = 0;
    cudaMemcpy(&sum, sum_d, sizeof(unsigned int), cudaMemcpyDeviceToHost);
    start = time(0) - start;
    std::cout << sum << "\t" << start << std::endl;
    cudaFree(sum_d);

    return 0;
}

Community · Accepted Answer · 2017-05-23 12:40:57Z

Try not to use using namespace std.

This is lacking in curly braces and could be bad for maintenance:

if(n == 2)
    return true;
if(n % 2 == 0)
    return false;
uint32_t sr = sqrtf(n);

for(uint32_t i = 3; i <= sr; i += 2)
    if(n % i == 0)
        return false;
return true;

It's a good idea to still use them for single-line statements, and you should especially use them with for loops. Some extra whitespace in places could be useful as well.

Here's what it should look like:

if (n == 2)
{
    return true;
}
if (n % 2 == 0)
{
    return false;
}

uint32_t sr = sqrtf(n);

for (uint32_t i = 3; i <= sr; i += 2)
{
    if (n % i == 0)
    {
        return false;
    }
}

return true;

Consider finding an alternative to sqrtf() as it can be slow. It may not make a huge difference in overall runtime, though. Fortunately, it appears to be thread-safe.
One source of slower performance may be the if statement in prime_sum(). Doing so may cause thread divergence, if threads end up executing different instructions. You'll have to analyze your code to determine this for sure.
This line seems confusing:
```
start = time(0) - start;
```
It appears that the lvalue should instead be named end since you already calculate a starting time beforehand.

David Hammen · Accepted Answer · 2015-08-09 14:36:48Z

It's not surprising that your serial code is faster than your GPU code. You are not taking advantage of the places where GPUs perform exceptionally well, and you are using the GPU in exactly the places where GPUs don't perform well at all.

Regarding the former, GPUs are geared toward floating point rather than integer operations, and on arrays where the same computation is performed on every element in the array. You are using integers rather than floating point numbers, and don't have any arrays at all (but you could if you used a sieve).

Regarding the latter, two things that GPUs don't do well at at all are integer division/modulus, and atomic operations. You are using both. There are over 7.6 million primes less than 512³, so you are calling AtomicAdd on a global variable over 7.6 million times. This alone is a huge bottleneck.

Stack Exchange Network

current community

your communities

more stack exchange communities

Calculating sum of primes using the CPU and GPU

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged c++ performance primes cuda or ask your own question.

Hot Network Questions

Calculating sum of primes using the CPU and GPU

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c++ performance primes cuda or ask your own question.

Related

Hot Network Questions