I am writing the CUDA core, which involves calculating the maximum value on a given matrix, and I evaluate the possibilities. The best way to find:
Forcing each thread to store the value in shared memory and using the reduction algorithm after that to determine the maximum (pro: minimal deviations: total memory is limited to 48 KB on 2.0 devices)
I could not use atomic operations because there is a read and write operation, so threads cannot be synchronized using synchthreads.
Any other idea comes to your mind?
source
share