I am currently developing with CUDA
on a nVidia GTX 480. According to the specification, the card has 15 Streaming Multiprocessors (SM) with 32 CUDA
cores each.
My code works with N
blocks with 32 cores each.
Ideally, if N <= 15
, I would expect that, since each block can be assigned to a different SM, each block will run just as fast as a single block. For 'N > 15', as the blocks start sharing the SMs, the performance of each individual block should decay. If maxocc
is the maximum occupancy of my kernel, then the performance should stagnate as of N > 15*maxocc
, as not all blocks can be scheduled on the SMs.
This is also almost what I observe in practice: The performance of each individual block starts to decay as of N = 12
and the performance stagnates as of N = 57
, i.e. it is as if there are three extra blocks occupying the SMs.
I have no other programs running that use the GTX 480. The card is, however, connected to an external display running a text console, i.e. not X-windows.
Now for the question: Does anybody know if using the GTX 480 to drive a console occupies CUDA
resources? And if so, how much exactly? And how can I avoid that, i.e. how can I deactivate the video output without deactivating the CUDA
device entirely?