Using cudaMemcpythe CUDA driver detects that you are copying a node pointer to a pointer to a host, and the copy is being executed on the CPU. Of course, you can use memcpy on the processor if you want.
If you use cudaMemcpy, an additional synchronization of the stream (which you can see in the profiler, but I assume that there is a test there and see) can be performed before executing the copy.
UVA cudaMemcpyDefault, . UVA (sm_20 + 64- ), (, cudaMemcpyDeviceToDevice). cudaHostRegister() , , cudaMemcpyDeviceToDevice , :
- Host ↔ Host: CPU (memcpy)
- Host ↔ : DMA ( )
- ↔ : Memcpy CUDA ( SM, )