When I evaluate my program, I saw that at some point I get up to 100 ms of time. I searched every operation, but individually no operation was performed this time. Then I noticed that wherever I make a call to cudaThreadSynchronize, the first call takes 100 ms. Then I wrote such an example below. When cudaThreadSynchronize is called on the first line, the sought time at the end is less than 1 ms. But if it is not called, it takes an average of 110 ms.
int main(int argc, char **argv)
{
cudaThreadSynchronize();
unsigned int timer;
cutCreateTimer(&timer);
cutStartTimer(timer);
float *data;
CUDA_SAFE_CALL(cudaMalloc(&data, sizeof(float) * 1024));
cutStopTimer(timer);
printf("CUT Elapsed: %.3f\n", cutGetTimerValue(timer));
cutDeleteTimer(timer);
return EXIT_SUCCESS;
}
, cudaThreadSynchronize() CUDA. , ? cudaThreadSynchronize , .