CUDAFunctionLoad in Mathematica - Indexing Issue

I am trying to debug an index problem that I have on a CUDA machine

Cuda Machine Info:

{1 โ†’ {Name-> Tesla C2050, Clock Rate-> 1147000, Compute Capabilities โ†’ 2., GPU Overlap-> 1, Maximum Block Dimensions โ†’ {1024,1024,64}, Maximum Grid Dimensions-> {65535,65535, 65535}, maximum flows per block-> 1024, maximum total memory per block-> 49152, total permanent memory-> 65536, warp size-> 32, maximum height-> 2147483647, maximum registers per block-> 32768, texture alignment- > 512, multiprocessor counter-> 14, kernel count-> 448, runtime-> 0, built-in-> False, Can Map Host Memory-> True, Compute Mode-> Default, Texture1D Width-> 65536, Texture2D Width -> 65536, Texture 2D Height-> 65535, Texture3D Width-> 2048, Texture3D Height-> 2048, Texture3D Depth-> 2048, Texture2D Array width-> 16384,Texture2D Array height-> 16384, Texture2D Array Slices-> 2048, surface alignment-> 512, parallel kernels-> True, ECC Enabled-> True, Total Memory-> 2817982462},

All this code sets the values โ€‹โ€‹of a three-dimensional array equal to the index that CUDA uses:

__global __ void cudaMatExp(
float *matrix1, float *matrixStore, int lengthx, int lengthy, int lengthz){

long UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;

long index = UniqueBlockIndex * blockDim.z * blockDim.y * blockDim.x +
    threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x +
    threadIdx.x;

if (index < lengthx*lengthy*lengthz) {

matrixStore[index] =  index;

}
}

For some reason, when the size of my 3D array gets too large, the indexing stops.

I tried different block sizes (blockDim.x by blockDim.y from blockDim.z):

8x8x8 gives proper indexing to 12x12x12 array size

9x9x9 only gives correct indexing to an array size of 14x14x14

10x10x10 only gives correct indexing to an array size of 15x15x15

With sizes larger than these sizes, all block sizes eventually begin to increase again, but they never reach the value dim ^ 3-1 (which is the maximum index that cuda stream should reach)

Here are some graphs that illustrate this behavior:

: x 3D- ( xxx), y - , cuda. 10x10x10.

enter image description here

(Mathematica) , , 1024x1x1:

CUDAExp = CUDAFunctionLoad[codeexp, "cudaMatExp",
  {{"Float", _,"Input"}, {"Float", _,"Output"},
    _Integer, _Integer, _Integer},
  {1024, 1, 1}]; (*These last three numbers are the block dimensions*)

max = 100; (* the maximum dimension of the 3D array *)
hold = Table[1, {i, 1, max}];
compare = Table[i^3, {i, 1, max}];
Do[
   dim = ii;
   AA  = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real, 
                                     "TargetPrecision" -> "Single"];
   BB  = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real, 
                                     "TargetPrecision" -> "Single"];

   hold[[ii]] = Max[Flatten[
                  CUDAMemoryGet[CUDAExp[AA, BB, dim, dim, dim][[1]]]]];

 , {ii, 1, max}]

ListLinePlot[{compare, Flatten[hold]}, PlotRange -> All]

, x ^ 3 , . , , a > 32

enter image description here

3D- , dim ^ 3-1. . dim = 32 cuda max 32767 ( 32 ^ 3 -1), dim = 33 cuda 33791, 35936 (33 ^ 3 -1). , 33791-32767 = 1024 = blockDim.x

:

, , Mathematica?

, __mul24 (threadIdx.y, blockDim.x) , , , .

, , - , -arch = sm_11, 1.0. , Mathematica. , CUDAFunctionLoad [] , 2.0. - ?

!

+3
1

, Mathematica , -, , , .

( , ).

, :

CUDAExp = 
  CUDAFunctionLoad[codeexp, 
   "cudaMatExp", {
           {"Float", _, "Input"}, {"Float", _,"Output"}, 
                        _Integer, _Integer, _Integer}, 
     {8, 8, 8}, "ShellOutputFunction" -> Print];

(8,8,8) .

CUDAExp[] , , :

, , :

// AA and BB are 3D arrays of 0 with dimensions dim^3
dim = 64;
CUDAExp[AA, BB, dim, dim, dim, 4089];

, CUDAFunctionLoad [] 5 , - , ( dim x dim x dim), - , . , .

6-, gridDim.x * blockDim.x, , , gridDim.x = 512, , 512 * 8 = 4089.

, - , .

+1

All Articles