The following python code uses PyOpenCL to populate the a_plus_b array with the sum of the elements in b array (this is not my actual goal, but it is the simplest code I can find that still shows the problem).
import pyopencl as cl
import numpy as np
import numpy.linalg as la
height = 50
width = 32
b = np.arange(width,dtype=np.int32)
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, height*4)
prg = cl.Program(ctx, """
__kernel void sum(__global const int *b, __global int *c)
{
int x = get_global_id(1);
int y;
c[x] = 0;
for(y=0;y<get_global_size(0);y++) {
c[x] += b[y];
}
}
""").build()
prg.sum(queue, (width,height), None, b_buf, dest_buf)
a_plus_b = np.empty(height,dtype=np.int32)
cl.enqueue_copy(queue, a_plus_b, dest_buf)
print(np.sum(b))
print(a_plus_b)
print(np.sum(a_plus_b-np.sum(b)))
Gives output:
496
[496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496
496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496
496 496 496 496 496 496 496 496 496 496 496 496 496 496]
0
However, if I change the width from 32 to 33, the array is no longer the same element again and again.
528
[555 557 555 559 560 528 560 560 528 528 528 528 528 528 528 528 528 528
528 528 528 531 540 569 581 528 705 591 560 560 545 560 560 528 560 528
528 528 528 528 528 528 528 528 528 528 528 532 533 535]
752
In fact, every time the code is executed, it produces a different result.
528
[560 560 559 560 560 560 560 528 528 528 528 528 528 528 528 528 528 528
528 528 528 560 528 514 565 553 621 650 560 560 560 560 560 528 528 528
528 528 528 528 528 528 528 528 549 528 528 544 528 537]
724
What makes the difference? What not