I know that CUDA cores can be “overlapped” by putting them in separate threads, but I wonder if memory can be transferred at runtime. CUDA kernels are asynchronous after
You can run kernels, transfer from host to device and forward from device to host at the same time.
http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
For clarification only, the above are only valid if your device supports it. You can check it running a request to the device and checking for a parallel copy and attribute execution