My CUDA program suffers from unrelated access to global memory. Although the idx-th stream deals only with the [idx] th cell in the array, there are many indirect memory accesses, as shown below.
int idx=blockDim.x*blockIdx.x+threadIdx.x;
.... = FF[m_front[m_fside[idx]]];
For m_fisde [idx] we have shared calls, but we really need FF [m_front [m_fside [idx]]]. There is two-tier indirect access.
I tried to find some data patterns in m_front or m_fsied to make this direct sequential access, but found that they were almost "random".
Is there any way to handle this?
source
share