CUDA: contraction or atomic operations?

Question

CUDA: contraction or atomic operations?

I am writing the CUDA core, which involves calculating the maximum value on a given matrix, and I evaluate the possibilities. The best way to find:

Forcing each thread to store the value in shared memory and using the reduction algorithm after that to determine the maximum (pro: minimal deviations: total memory is limited to 48 KB on 2.0 devices)

I could not use atomic operations because there is a read and write operation, so threads cannot be synchronized using synchthreads.

Any other idea comes to your mind?

+3

algorithm matrix reduction cuda gpu-atomics

Marco A. May 07, '11 at 21:01

source share

7 answers

, CUDA Thrust, CUDA 4.0 .

nVidia . , /.

, .

. . tkerwin.

+6

peakxu 09 '11 12:40

NVIDIA CUDA, : . , .

+3

tkerwin 07 '11 22:05

CUDA. , .

+2

jtimon 20 . '12 17:37

, , , . (, ). , , .

, , . - , - libcub nVIDIA Duane Merill. - .

, , , , . ( , ), - . , , , atomicMax(), - , .

+1

einpoklum 22 . '16 9:58

atomicAdd , , . http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/

0

Sayan 18 . '11 18:29

If you have K20 or Titan, I suggest dynamic parallelism: dine with a single thread kernel, which dries out #items core workflows to create data, then dies # items / first-round-reduce-factor threads for the first round cut and continue lunch to get the result.

0

W.Sun May 02, '13 at 20:58

source share

Pavan Yalamanchili · Accepted Answer · 2011-05-07T23:04:37+0000

This is the usual way to perform abbreviations in CUDA.

Inside each block

1) . , n ( 16 32),

2) , .

, ( ) * sizeof (datatye) .

, , .

, 256 16 , (256 * 16 = 4096) .

, 1 250 , .

, , > (4096) ^ 2.

, . , , .

CUDA: contraction or atomic operations?

More articles: