19
votes
2answers
3k views

How do I make an already written concurrent program run on a GPU array?

I have a neural network written in Erlang, and I just bought a GeForce GTX 260 card with a 240 core GPU on it. Is it trivial to use CUDA as glue to run this on the graphics card?
2
votes
1answer
722 views

For nested loops with CUDA

I'm having a problem with some for nested loops that I have to convert from C/C++ into CUDA. Basically I have 4 for nested loops which are sharing the same array and making bit shift operations. ...
4
votes
2answers
2k views

How to measure the execution time of every block when using CUDA?

clock() is not accurate enough.
2
votes
1answer
267 views

Does early exiting a thread disrupt synchronization among CUDA threads in a block?

I am implementing a certain image processing algorithm with CUDA and I have some questions about the thread synchronization issue overall. The problem at hand can be explained like that: We have an ...
1
vote
1answer
568 views

Shared memory matrix multiplication kernel

I am attempting to implement a shared memory based matrix multiplication kernel as outlined in the CUDA C Programming Guide. The following is the kernel: __global__ void matrixMultiplyShared(float * ...
0
votes
1answer
246 views

Dynamic programming in CUDA: global memory allocations to exchange data with child kernels

I have a the following code: __global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, ...
115
votes
16answers
7k views

Why aren't we programming on the GPU? [closed]

So I finally took the time to learn CUDA and get it installed and configured on my computer and I have to say, I'm quite impressed! Here's how it does rendering the Mandelbrot set at 1280 x 678 ...
16
votes
16answers
3k views

What future does the GPU have in computing? [closed]

Your CPU may be a quad-core, but did you know that some graphics cards today have over 200 cores? We've already seen what GPU's in today's graphics cards can do when it comes to graphics. Now they ...
9
votes
2answers
1k views

Python Multiprocessing with PyCUDA

I've got a problem that I want to split across multiple CUDA devices, but I suspect my current system architecture is holding me back; What I've set up is a GPU class, with functions that perform ...
5
votes
3answers
2k views

help me understand cuda

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda. please could anybody clarify these info: an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs. i was ...
5
votes
2answers
564 views

What is the cheapest way to build an Erlang server farm (for a hobby project)? [closed]

Let's say we have an 'intrinsically parallel' problem to solve with our Erlang software. We have a lot of parallel processes and each of them executes sequential code (not number crunching) and the ...
4
votes
3answers
370 views

CUDA, NPP Filters

The CUDA NPP library supports filtering of image using the nppiFilter_8u_C1R command but keep getting errors. I have no problem getting the boxFilterNPP sample code up and running. eStatusNPP = ...
3
votes
3answers
1k views

CUDA - Implementing Device Hash Map?

Does anyone have any experience implementing a hash map on a CUDA Device? Specifically, I'm wondering how one might go about allocating memory on the Device and copying the result back to the Host, ...
2
votes
1answer
181 views

vector step addition slower on cuda

I am trying to run the vector step addition function on CUDA C++ code, but for large float arrays of size 5,000,000 too, it runs slower than my CPU version. Below is the relevant CUDA and cpu code ...
1
vote
1answer
1k views

Realistic deadlock example in CUDA/OpenCL

For a tutorial I'm writing, I'm looking for a "realistic" and simple example of a deadlock caused by ignorance of SIMT / SIMD. I came up with this snippet, which seems to be a good example. Any ...

1 2 3
15 30 50 per page