Tagged Questions
0
votes
0answers
7 views
Is there any good tutoria or reference for writing code with Magma?
Currently I am trying to use Magma to do matrix operation on GPU, however, I found few documents about it. The only thing I can refer to is its testing program and the online generated document(here), ...
0
votes
1answer
24 views
how does Multithreading in GPUs work?
How does a GPU handle multithreading ??
In CPUs for example there will be independent copies of the Register File for each thread. But with large register files as in GPUs that will be impossible. So ...
0
votes
0answers
5 views
Running AMD GPU Assembly
I am trying to run AMD GPU Assembly on my PC. I am using Ubuntu 12.04 64-bit and Windows 7 Ultimate. I am using 6XX GPU. Please tell me how to run it. A good resource links is also helpful. If you can ...
2
votes
1answer
41 views
CUDA: How does Thrust manage memory when using a Comparator in a sorting function?
I have a char array of 10 characters that I would like to pass as an argument to a comparator which will be used by Thrust's sorting function.
In order to allocate memory for this array I use ...
2
votes
1answer
59 views
Is it possible to deallocate memory for the N last elements of a thrust::device_vector without using resize?
I'm using a device_vector in order to store information about an array of user input data. This information is necessary in order to speed things up when I call the second kernel, which runs the main ...
3
votes
5answers
259 views
High level GPU programming in C++
I've been looking into libraries/extensions for C++ that will allow GPU-based processing on a high level. I'm not an expert in GPU programming and I don't want to dig too deep. I have a neural network ...
1
vote
2answers
69 views
Am I right thinking that modern consumer graphics cards use exactly the same GPU structures for actual graphics rendering and bare computations?
Am I right thinking that modern consumer graphics cards (say those conventional nVidia and ATi models) use exactly the same GPU structures and operations for actual graphics rendering (through ...
0
votes
1answer
103 views
Cuda Kernel with reduction - logic errors for dot product of 2 matrices
I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right ...
0
votes
0answers
139 views
CUDA appears to be extremely slow
Taking my first steps in CUDA, I tried this simple example-code which runs perfectly fine, but appears to be extremely very slow. I compiled it using nvcc version 5.0 using the commands:
$ ...
0
votes
1answer
140 views
CUDA FFT exception
I'm trying to use CUDA FFT aka cufft library
Problem occured when cufftPlan1d(..) throws an exception.
#define NX 256
#define BATCH 10
cufftHandle plan;
cufftComplex *data;
...
0
votes
0answers
32 views
GPU selection on sending image
I have a strange situation.
I installed 2 video cards on same computer. And now I have to send images/frames through these video cards. But I don't have any ideas to select a GPU to sending my data ...
0
votes
0answers
85 views
Scattering on CUDA
I'm trying to implement the following:
for (unsigned int j = 0; j < numElems; ++j) {
unsigned int bin = (input[j] & mask) >> offset;
output[source[bin]] = input[j];
source[bin]++;
...
1
vote
2answers
350 views
Accessing GPU via web browser
I came across this proof of concept earlier today (on TechCrunch.com) and was blown away and intrigued as to how they had managed to accomplish the end result. They state that they don't use webGL or ...
-1
votes
1answer
85 views
reading cuda data in burst mode
I currently have CUDA code that is performing around 3-4x slower than CPU code.
I removed all extraneous CPU/GPU transfers so that most of the computation is being done on the GPU, and only the final ...
0
votes
1answer
96 views
error CL_OUT_OF_RESOURCES while reading back data in host memory while using atomic function in opencl kernel
I am trying to implement atomic functions in my opencl kernel. Multiple threads I am creating are parallely trying to write a single memory location. I want them to perform serial execution on that ...
1
vote
1answer
145 views
CUDA not so fast against CPU with OpenMP?
I am trying to compute cross-correlation amongst 450 vectors each of size 20000.
While doing this on CPU i stored the data in 2D matrix with rows=20000 and cols=450.
The serial code for the ...
1
vote
0answers
61 views
Number of working GPU SMs [closed]
Is it possible to monitor the number of SMs free at a given point in time? How is gpu_sm_speed calculated? Is that the average or of individual SMs (I guess the execution time of each SM can be ...
0
votes
1answer
204 views
GPU gives no performance improvement in Julia set computation
I am trying to compare performance in CPU and GPU. I have
CPU : Intel® Core™ i5 CPU M 480 @ 2.67GHz × 4
GPU : NVidia GeForce GT 420M
I can confirm that GPU is configured and works correctly with ...
0
votes
1answer
173 views
opencl local memory half threads from a group gets correct execution
I have written a kernel in opencl using local memory to get the faster execution. This is the first time I am using local memory. My global_work_size = 16 and local_work_size = 8.
Opencl kernel: ...
0
votes
1answer
86 views
How to allocate all of the available shared memory to a single block in CUDA?
I want to allocate all the available shared memory of an SM to one block. I am doing this because I don't want multiple blocks to be assigned to the same SM.
My GPU card has 64KB (Shared+L1) memory. ...
0
votes
1answer
138 views
Race condition in opencl kernel threads
If multiple threads are simultaneously writing a single memory location.,there will be a race condition,right??
In my case same is happening..
Consider a module from 'reduce.cl'
int i = ...
0
votes
0answers
203 views
Per vertex mesh deformation
I am doing a project where i want to have i vertex buffer (in opengl) where I have vertices that make out a mesh of an image. Meaning that each pixel of the image consists of two triangles (a square ...
0
votes
2answers
756 views
disable Force GPU rendering programming
I want disable Force GPU rendering in my android program . now i have to go setting on device and disable it , but it is hard for my user.
1
vote
1answer
63 views
What should I consider when choosing a Video Card for GPGPU [closed]
What are the key things to consider when looking for a video card to be used with C++ AMP? I can't afford a high end compute dedicated GPU or workstation GPU so I'm looking at cards in the sub $600 ...
0
votes
1answer
175 views
Which is better ? Loop inside kernel or Looping kernels for CUDA GPU
Device GeForce GTX 680
In the program, i have very long array to be processed inside kernel.(Approx 1 GB of integers).As per need,My array is divided into blocks sequentially with some ...
1
vote
1answer
74 views
For different runs, Previous Values are retained in global memory for kernel arguments for CUDA GPU
Device GeForce GTX 680
In my program,value is copied from host to device variable using CUDA Memcpy. I could see that previous values are retained in global memory on different executions of ...
1
vote
1answer
138 views
Optimizing Cuda kernel regarding normalisation of array
I'm trying to normalise the array as follows.
Pick the first two elements of the array, find the sum and divide them using that sum.
Do the same for rest of the elements.
It works fine. But when ...
3
votes
2answers
2k views
How to configure OpenCL in visual studio2010 for nvidia's gpu on windows?
I am using NVIDIA's GeForce GTX 480 GPU on windows & operating system. I have already configured Visual Studio 2010 for CUDA 4.2. How to configure OpenCL for nvidia's gpu on visual studio 2010??
...
2
votes
1answer
334 views
How many OpenCL registers has ATI Radeon HD 6750M and 6970M?
I cannot find any information about number of registers in the ATI Radeon HD 6750M and 6970M GPUs. I want to optimize my OpenCL kernels to utilize as many as possimbe processing units, so I need to ...
-1
votes
1answer
79 views
Bads results with gpu program [closed]
I haven't got good results with an iterative equation solving.
I am using a 2D array with "size_y" rows with "size_x" elements for each row.
The problem is that the code only does one iteration ...
0
votes
2answers
223 views
Optimization tips for a cuda code
I wrote a piece of code for computing Self Quotient Image (SQI) in MATLAB. And now i want to rewrite a part of it in parallel for speedup.
this part of code is:
siz=15;
X=normalize8(X);
...
0
votes
2answers
228 views
Cannot read out Values from Texture Memory
Hi I'm writing a simple Program for practicing to work with texture memory. I Just want to write my data into Texture Memory and write it back into Global Memory. But i cannont read out the Values. ...
3
votes
1answer
214 views
How does the speed of CUDA program scale with the number of blocks?
I am working on Tesla C1060, which contains 240 processor cores with compute capability 1.3. Knowing that each 8 cores are controlled by a single multi-processor, and that each block of threads is ...
0
votes
2answers
734 views
CUDA kernel doesn't launch
My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects ...
0
votes
1answer
154 views
cuda invalid configuration error 9
I have a Cuda application; after first allocating cuda memory for various arrays the program loops through: transfer data to GPU, Process kernels on GPU, transfer data back from GPU. The first data ...
0
votes
1answer
156 views
Saving Values after Calculating with Texture Memory
Hi I have a simple Calculation Using Texture Memory. But i am not able to save the right results.
The result should be a interpolation. For example angle = 0.5 A[0] = 1, B[0] = 2, result[0] should be ...
3
votes
4answers
254 views
Accuracy of GPU for scientific computing
An electrical engineer recently cautioned me against using GPUs for scientific computing (e.g. where accuracy really matters) on the basis that there are no hardware safeguards like there are in a ...
0
votes
2answers
724 views
Interpolation with CUDA Texture memory
I would like to use the texture Memory for Interpolation of Data. I have 2 Arrays and I would want to interpolate Data between them (between A[i] and B[i]). Now I thought I could bind them to texture ...
4
votes
4answers
553 views
Which Java code can be moved to the GPU?
With the framework rootbeer is GPU programming for Java possible.
Which Java code should be used for rootbeer and which code should better run in the Java VM self?
Or other: which code produce ...
3
votes
1answer
422 views
Linking with 3rd party CUDA libraries slows down cudaMalloc
It is not a secret that on CUDA 4.x the first call to cudaMalloc
can be ridiculously slow (which was reported several times), seemingly a bug in CUDA drivers.
Recently, I noticed weird behaviour: the ...
0
votes
2answers
189 views
Creating a copy of the buffer pointed by host ptr on the GPU from GPU kernel in OpenCL
I was trying to understand how exactly CL_MEM_USE_HOST_PTR and CL_MEM_COPY_HOST_PTR work.
Basically when using CL_MEM_USE_HOST_PTR, say in creating a 2D image, this will copy nothing to the device, ...
5
votes
4answers
368 views
GPU reads from CPU or CPU writes to the GPU?
I am beginner in parallel programming. I have a query which might be seem to be silly but I didn't get a definitive answer when I googled it out.
In GPU computing there is a device i.e. the GPU and ...
2
votes
1answer
180 views
About compact operation in cuddpp
The following kernel function is the compact operation in the cudpp, a cuda library (http://gpgpu.org/developer/cudpp).
My question is why the developer repeats the the writing part 8 times? And why ...
2
votes
4answers
1k views
Nsight skips (ignores) over break points in VS10 Cuda works fine, nsight consistently skips over several breakpoints
I'm using nsight 2.2 , Toolkit 4.2 , latest nvidia driver , I'm using couple gpu's in my computer. Build customize 4.2.
I have set "generate GPU ouput" on CUDA's project properties, nsight monitor is ...
3
votes
1answer
373 views
What happened when alll thread of a warp read the same global memory?
I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are there? Is there any serialization. The GPU is Fermi card, the ...
0
votes
0answers
188 views
Calling cutilExit(argc, argv) cause error
in the end of dll ( slightly modified example ) is called cutilExit(argc, argv) and cause:
Error when parsing command line argument string.
Have no idea what is the problem and not sure which ...
2
votes
1answer
93 views
Is there a way to limit or prioritize how much processing power an OpenCL application can use?
First, I'm not even an OpenCL newbie-- I know what it is but I haven't so much as written one line of code. However, I have looked through some OpenCL on a very simple, open-source project and ...
0
votes
1answer
398 views
Using OpenCV with GPU that is not factory built-in? [closed]
I want to speed up my OpenCV based software for real-time operation using the OpenCV's GPU support library. My computer does not have an in-built GPU supported by OpenCV, so here goes my questions:
...
0
votes
2answers
356 views
How to reduce the branch divergence of binary search using CUDA
The application is to intersect two sorted list of integers (set intersection), say list1 and list2.
Each element of list1 will be assigned a GPU thread, and do binary search to check whether it ...
2
votes
1answer
3k views
what is difference between “-arch sm_13” and “-arch sm_20”
I need double precision calculation in my application. According what I found on google I should add a flag "-arch sm_13" or "-arch sm_20".
Q1: What is the difference between "-arch sm_13" and "-arch ...