I ran into some strange problem. I implement some linear algebra, only matrix multiplications so far, in OpenCL, and tested it on my laptop. The code is very simple:
__kernel void matrix_mult(__global float* a,
__global float* b,
__global float* c,
const int N)
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
for (int i = 0; i < N; i++) {
sum += a[row*N+i] * b[i*N+col];
}
c[row*N+col] = sum;
}
I test the hardware by running the code 100 times as follows:
clock_t begin=clock();
const unsigned int repeats = 100;
for(int i = 0; i != repeats; i++){
runCL(a, b, results,N, N*N);
}
clock_t end=clock();
In my MBP, matrix_multiplications take about 1.2 ms, on matrices of 512 * 512, and the same code takes about 3 ms when working in a GTX 480 Linux box. It bothers me, I would not expect the expensive GTX card to be any faster than a laptop.
As far as I can see, either my code is "wrong" or I'm wrong.
I tried to use the event-based synchronization system in the OpenCL specification, which gave somewhat more realistic results.
cl_event event = {0};
err = clEnqueueNDRangeKernel(cmd_queue, kernel[0], 2, NULL, global_work_size, NULL, 0, NULL, &event);
assert(err == CL_SUCCESS);
cl_int err = clWaitForEvents (1,&event);
cl_ulong start, end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
double executionTimeInMilliseconds = (end - start) * 1.0e-6f;
std::cout << "execution time in milis : " << executionTimeInMilliseconds << std::endl;
GT330M 46 , GTX480 2,5 . : PROFILING GT 330M 30 , , GTX480 . - , ?