CUDA: contraction or atomic operations?

I am writing the CUDA core, which involves calculating the maximum value on a given matrix, and I evaluate the possibilities. The best way to find:

Forcing each thread to store the value in shared memory and using the reduction algorithm after that to determine the maximum (pro: minimal deviations: total memory is limited to 48 KB on 2.0 devices)

I could not use atomic operations because there is a read and write operation, so threads cannot be synchronized using synchthreads.

Any other idea comes to your mind?

+3
source share
7 answers

This is the usual way to perform abbreviations in CUDA.

Inside each block

1) . , n ( 16 32),

2) , .

, ( ) * sizeof (datatye) .

, , .

, 256 16 , (256 * 16 = 4096) .

, 1 250 , .

, , > (4096) ^ 2.

, . , , .

+4

, CUDA Thrust, CUDA 4.0 .

nVidia . , /.

, .

. . tkerwin.

+6

NVIDIA CUDA, : . , .

+3

CUDA. , .

+2

, , , . (, ). , , .

, , . - , - libcub nVIDIA Duane Merill. - .

, , , , . ( , ), - . , , , atomicMax(), - , .

+1

If you have K20 or Titan, I suggest dynamic parallelism: dine with a single thread kernel, which dries out #items core workflows to create data, then dies # items / first-round-reduce-factor threads for the first round cut and continue lunch to get the result.

0
source

All Articles