Cuda is awesome and I use it like crazy, but I do not use its full potential because I have a memory transfer problem and I was wondering if there is a better way to get variable memory. Basically I send an array of 65535 arrays to Cuda, and Cuda parses each data item in about 20,000 different ways, and if there is a match in my program logic, then it saves a 30 int list as a result. Think about my logic of analyzing every other combination, and then look at the total amount, and if the sum is equal to the number I'm looking for, it stores the results (which is a list of 30 ints for each element being analyzed).
Problem 65535 (blocks / elements in the data array) * 20,000 (total number of checked combinations of elements) = 1,310,700,000. This means that I need to create an array of this size to deal with the fact that all the data will have a positive result (which is extremely unlikely, and the creation int output[1310700000][30]seems crazy for the memory). I was forced to make it smaller and send fewer blocks for processing, because I do not know how Cuda can efficiently write a linked list or a list with a dynamic size (with this approach, it writes output to the host memory using the * number_of_different_way_tests block).
Is there a better way to do this? Can Cuda somehow write to free memory, which is not a derivative of the blockade? When I test this process on the processor, less than 10% of the array of elements have a positive match, so it is highly unlikely that I will use that much memory every time I send work to the kernel.
ps I look above, and although this is exactly what I do, if it is confusing, then another way to think about it (not quite what I am doing, but good enough to understand the problem), I send 20,000 arrays (each of them contain 65,535 elements) and adding each element with its partner in other arrays, and if the total number is equal to the number (say, 200-210), then I want to know the numbers that he added in order to get the corresponding result. If the numbers vary very widely, then not everyone will match, but using my approach, I have to malloc this huge amount of memory. Can I get results with less memory? My current approach to malloc is as much as free, but I have to run fewer blocks that are inefficient (I want to run as many blocks and threads as possible, because I likehow does Cuda organize and run blocks) Are there any Cuda or C tricks that I can use for this, or am I stuck in mallocing the maximum possible results (and buying more memory)?