My question is about the combined global record of a dynamically changing set of array elements in CUDA. Consider the following kernel:
__global__ void
kernel (int n, int *odata, int *idata, int *hash)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
odata[hash[i]] = idata[i];
}
Here, the first elements of the narray hashcontain indexes odatafor updating from the first nelements idata. Obviously, this leads to a terrible, terrible lack of coalescence. In the case of my code, the hash on one kernel call is completely unrelated to the hash on the other (and other kernels update the data in other ways), so simply reordering the data to optimize this particular kenrel is not an option.
Is there any feature in CUDA that would allow me to improve the performance of this situation? I heard a lot of talk about texture memory, but I could not translate what I read into a solution to this problem.
source
share