The documents you are reading are old. The first two generations of CUDA GPUs had 8 cores per MP and wrote out instructions from one warp (if you want to simplify, each command will be executed four times on 8 cores to serve one warp).
Fermi. " " ( warp 16 ). , 16 , .. . 2.1. 2.0 32 . MP- - . , , , 48 .
, - 14 , 336 7 GTX 560, , " " . , , , F .