1

I am currently developing with CUDA on a nVidia GTX 480. According to the specification, the card has 15 Streaming Multiprocessors (SM) with 32 CUDA cores each.

My code works with N blocks with 32 cores each.

Ideally, if N <= 15, I would expect that, since each block can be assigned to a different SM, each block will run just as fast as a single block. For 'N > 15', as the blocks start sharing the SMs, the performance of each individual block should decay. If maxocc is the maximum occupancy of my kernel, then the performance should stagnate as of N > 15*maxocc, as not all blocks can be scheduled on the SMs.

This is also almost what I observe in practice: The performance of each individual block starts to decay as of N = 12 and the performance stagnates as of N = 57, i.e. it is as if there are three extra blocks occupying the SMs.

I have no other programs running that use the GTX 480. The card is, however, connected to an external display running a text console, i.e. not X-windows.

Now for the question: Does anybody know if using the GTX 480 to drive a console occupies CUDA resources? And if so, how much exactly? And how can I avoid that, i.e. how can I deactivate the video output without deactivating the CUDA device entirely?

2 Answers 2

3

The CUDA architecture does not guarantee that on a 15 SM device that 15 blocks will be distributed 1 per SM. The compute work distributor is likely distributing 2 blocks to several SMs. The Parallel Nsight Instruction Statistics experiment shows a graph of Warps Launched Per SM and Active Cycles Per SM. In your case I believe you will find the distribution to be: 9 SMs have 1 block, 3 SMs have 2 blocks, and 3 SMs have no blocks.

If you are launching less than SM count blocks then you can try to force 1 block per SM by increasing the dynamic shared memory per block to 1/2 shared memory + 1 byte (this is specified as the 3rd argument in the triple angle brackets). This will force occupancy to a single block per SM. If you do this and are trying to run concurrent kernels you may affect concurrency.

On the current architectures a CUDA context has exclusive use of all SMs when a kernel is running.

3
  • Thanks for the reply! It seems a bit odd that the scheduler would let SMs idle. Do you have a good reference for how it works? Since I'm on a non-Windows machine I can't use Parallel Nsight myself to verify this.
    – Pedro
    Commented May 8, 2012 at 8:53
  • I cannot find a reference for this behavior. It is recommended that a grid launch sufficient work to fill the device. Some scheduling artifacts can appear if you the launch does not fill the device. If you are interested in investigating the behavior then you can use the PTX special variable %smid (see inline PTX sample) to create a per SM software counter. At the beginning of your kernel read %smid and have each warp (or block) atomically increment the software counter for that SM.
    – Greg Smith
    Commented May 9, 2012 at 3:48
  • Before adding assembler calls to my code to verify this, I would really like to know if this type of behaviour is described by nVidia somewhere and not just a hunch. Do you have any source at all on how the scheduler works?
    – Pedro
    Commented May 9, 2012 at 10:29
1

A bunch of guesses follow:

I'm guessing that the old CGA text modes are emulated, so there's no dedicated hardware for them on a Fermi chip. Then it's possible that at each vblank, a shader is called that renders the current state of the CGA text buffer.

I'm also guessing that the cards don't support the low resolutions that were in use then, or the monochrome color depth. The result is that there might be a lot of 32 bit pixels that have to be updated at 60 FPS just to render CGA text.

One thing to try would be to add another graphics card or use the onboard graphics (if available), so that you can run the CUDA card without a monitor attached. If you try that, make sure to set the non-CUDA card as the primary graphics card in the PC BIOS.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.