I have a problem that I cannot solve.
The problem is as follows.
CPP Code
const int dataSize = 65535;
const int category = 10;
float data[dataSize][category];
const float threshold = 0.5f;
int cnt = 0;
for(int i=0;i<dataSize;i++)
{
if( data[i][9] > threshold )
{
data[cnt][0] = data[i][0];
data[cnt][1] = data[i][1];
data[cnt][2] = data[i][2];
data[cnt][3] = data[i][3];
data[cnt][4] = data[i][4];
data[cnt][5] = data[i][5];
data[cnt][6] = data[i][6];
data[cnt][7] = data[i][7];
data[cnt][8] = data[i][8];
data[cnt][9] = data[i][9];
cnt++;
}
}
Using this code, I expect the data array element "data" to be collected by a threshold value. (An element that does not exceed the threshold value is not important to me. Only exceeding the threshold value is important.)
I want code that works with the same result in CUDA.
So, I tried to do so.
CUDA Code
__global__ void checkOverThreshold(float *data, float threshold, int *nCount)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if( data[idx*10+9] > threshold )
{
data[nCount+0] = data[idx*10+0];
data[nCount+1] = data[idx*10+1];
data[nCount+2] = data[idx*10+2];
data[nCount+3] = data[idx*10+3];
data[nCount+4] = data[idx*10+4];
data[nCount+5] = data[idx*10+5];
data[nCount+6] = data[idx*10+6];
data[nCount+7] = data[idx*10+7];
data[nCount+8] = data[idx*10+8];
data[nCount+9] = data[idx*10+9];
atomicAdd( nCount, 1);
}
}
....
checkOverThreshold<<< dataSize / 128, 128 >>>(d_data, treshold, d_count);
But the result of the CUDA code is not the one I expected.
It contains many values for garbage and even the result does not match the CPP result.
I think the nCount variable synchronization problem makes this situation.
But I have no idea to solve this problem.
Please help my code. Thank you in advance.