OpenCL code runs faster on MBP than on NVIDIA GTX 480

I ran into some strange problem. I implement some linear algebra, only matrix multiplications so far, in OpenCL, and tested it on my laptop. The code is very simple:

__kernel void matrix_mult(__global float* a, 
              __global float* b, 
              __global float* c,
              const int N) 
{
  int row = get_global_id(1);
  int col = get_global_id(0);
  float sum = 0.0f;
  for (int i = 0; i < N; i++) {
    sum += a[row*N+i] * b[i*N+col];
  }
  c[row*N+col] = sum;
}

I test the hardware by running the code 100 times as follows:

  clock_t begin=clock(); 

  const unsigned int repeats = 100;
  for(int  i = 0; i != repeats; i++){
    runCL(a, b, results,N, N*N);
  }

  clock_t end=clock();

In my MBP, matrix_multiplications take about 1.2 ms, on matrices of 512 * 512, and the same code takes about 3 ms when working in a GTX 480 Linux box. It bothers me, I would not expect the expensive GTX card to be any faster than a laptop.

As far as I can see, either my code is "wrong" or I'm wrong.

I tried to use the event-based synchronization system in the OpenCL specification, which gave somewhat more realistic results.

cl_event event = {0}; 
err = clEnqueueNDRangeKernel(cmd_queue, kernel[0], 2, NULL, global_work_size, NULL, 0, NULL, &event);
assert(err == CL_SUCCESS);


cl_int err =  clWaitForEvents (1,&event);
cl_ulong start, end; 
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,   sizeof(cl_ulong), &end,   NULL); 
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL); 
double executionTimeInMilliseconds = (end - start) * 1.0e-6f;
std::cout << "execution time in milis : " << executionTimeInMilliseconds << std::endl;

GT330M 46 , GTX480 2,5 . : PROFILING GT 330M 30 , , GTX480 . - , ?

+3
3

, GTX480 .

, , ; B, - .

GTX480 3- (384-) 2- (1840 ) , GT330M (128 , 800 ). , 177,4 / 25,6 /, , , . - b- 32 384- , 330M - 32 128- . , b 14,8 / 6,4 /; , 2, 7 , ; , 10x , , . , , 2x, 2.5x, .

.

, , . , 330M , ? GTX , , , , .

+4

, Nvidia. clGetDeviceInfo() CL_DEVICE_PROFILING_TIMER_RESOLUTION, . .

+2

A few ms can be the difference between initialization procedures for each code, especially when both test systems have different hardware. I recommend starting with testing a larger kit that requires at least a few seconds on both the laptop and the nVidia card.

+1
source

All Articles