Blocks, Streams, warpSize

There was a lot of discussion about how to select #blocks and blockSize, but I still can't see anything. Many of my problems address this question: How are CUDA Blocks / Warps / Threads mapped to CUDA kernels? (To simplify the discussion, enough memory perThread and perBlock. This is not a problem).

kernelA<<<nBlocks, nThreads>>>(varA,constB, nThreadsTotal);

1) To keep SM as busy as possible, I have to set nThreadsto multiplicity warpSize. Truth?

2) SM can only execute one core at a time. These are all SM HWcores that run only kernelA. (Not some HWcores run kernelA, while others run kernelB.) Therefore, if I have only one thread to run, I "waste" other HWcores. Truth?

3) If the problems with the warp scheduler work in units warpSize(32 threads) and each SM has 32 HWcores, then the SM will be fully used. What happens when an SM has 48 HWcores? How can I save all 48 cores to the fullest when the scheduler releases work in pieces of 32? (If the previous paragraph is correct, would it be better if the scheduler issued the work in units of size HWcore?)

4) It appears that warp-scheduler queues 2 tasks at a time. Thus, when the current executable kernel is stopped or blocked, the second core is swapped. (This is not clear, but I guess there are more than two cores in line). Is it correct?

5) HW 512 (nThreadsMax), , ​​ 512 . ( , mem .) , , 512- ​​ , . SM. ?

5a) , , , , nBlocks? , nBlocks, ? ( .) nBlocks, trial-n-err.

+5
2

.

+3

1) .

2) CC 2.0 - 3.0 16 . SM 8 , concurrency 2 SM.

3) , warp . CUDA, . , parallelism . > 25% CC 1.x > 50% CC >= 2.0. CC 3.0 , 2.0 , - , 33% , . Nsight VSE - , . , Visual Profiler .

4) ; , . CC 2.x 3.0 CUDA ; .

5) SM, . , , . . ; , syncthreads (SM , ) , , , . 128 256 . , . 5a) . . 8 SM.

+5
source

All Articles