CUDA is a parallel computing platform and programming model for Nvidia GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs.
15
votes
2answers
2k views
Generating prime numbers using Sieve of Eratosthenes with CUDA
I'm learning CUDA and wrote a little program which generates prime numbers using the Sieve of Eratosthenes. (I know the limitations of CUDA, specially with memory sizes and limits, but this program is ...
15
votes
1answer
248 views
C and CUDA: circular buffer implementation
I have a programme which uses many circular buffers in an identical fashion on a CPU and GPU (C and C/C++ CUDA). I essentially require many queues, however, due to this being run on a GPU, I have ...
7
votes
1answer
295 views
3D vector CUDA kernel
I designed this CUDA kernel to compute a function on a 3D domain:
p and Ap are 3D vectors that are actually implemented as a ...
6
votes
2answers
437 views
Convert a 24bit bitmap to grayscale
I wrote this so I can learn CUDA. This is coded to work on my laptop's Nvidia GeForce GT 540M.
Main points I need reviewed:
CUDA programming conventions
Performance, especially kernel speed
C ...
5
votes
1answer
255 views
Implementation of AES using CUDA
I am trying to implement AES on GPU using CUDA programming. I use 4 TBoxes in my implementation that requires 4kB of GPU Memory. I have used a 1KB array for 1KB plaintext. first all of plaintext would ...
5
votes
1answer
154 views
CUDA Kernel - Neural Net
I'm building a spiking neural net (recurrent, integrate and fire), and I'm curious about how to reduce the warp divergence (and other problems) I may have.
Here's an example with a few hand-placed ...
4
votes
2answers
298 views
Calculating neurons and derivatives
This function runs very often. cudaMemcpy is at the start and works very slowly. How can I change this function to avoid this? I already have ...
4
votes
2answers
230 views
Calculating sum of primes using the CPU and GPU
This is a little baffling to me as to why the CUDA code runs about twice as slow as the CPU version. I am just counting all the primes from 0 to (512 * 512 * 512). The CPU version executed in about 97 ...
4
votes
1answer
47 views
A “policy-based” design for a generic CUDA kernel
I am faced with a design issue that has been discussed on SO several times, the most similar question being this one. Essentially, I want polymorphism for a CUDA kernel in the form of a "generic" ...
3
votes
0answers
53 views
Parallel reduction by key implementations
I have an implementation of the reduction approach used in this document. Furthermore, I extended (crudely) this so I can reduce-by-key.
In my setup I can assume that a ...
2
votes
0answers
58 views
Calculating the distance between several spatial points
I am developing a CUDA program and I want to enhance my performance. I have a kernel function which is consuming more than 70% of execution time. The kernel calculates the distance between several ...
1
vote
1answer
60 views
CUDA brute force 48 bit key
I have a cryptographic function with two 24 bit keys.
I have two blocks of input and two blocks of output, and want to brute force the keys using CUDA.
Overview:
The function is composed to two ...
1
vote
0answers
69 views
Unwrapping multiple inner loops in CUDA for 4D nonlocal filter
I'm working on some sort of non-local means filtering in 4D space (x,y,z + time). The idea is to pass to GPU a chunk of large 4D array in order to process it and return a filtered 3D slice (then ...
1
vote
0answers
36 views
Cuda C Matrix Compression
I am using Cuda to learn and implement a CSR matrix compression algorithm. What can I do better relating to C's best practices?
main.c:
...