Better or the same: CPU memcpy () versus cudaMemcpy () device on pinned, mapped memory in CUDA?

I have:

  • The host memory that was successfully pinned and displayed with cudaHostAlloc(..., cudaHostAllocMapped)or cudaHostRegister(..., cudaHostRegisterMapped);
  • Device pointers were obtained using cudaHostGetDevicePointer(...).

I run cudaMemcpy(..., cudaMemcpyDeviceToDevice)for pointers src and dest device, which point to two different areas of pinned + mapped memory obtained by the above method. Everything is working fine.

Question. Should I continue to do this or just use the traditional CPU style memcpy(), since it is still in the system memory? ... or are they the same (i.e. maps cudaMemcpyto a straight line memcpywhen both src and dest are bound)?

(I still use the method cudaMemcpybecause everything was previously in the global memory of the device, but has since switched to sticky memory due to gmem size limitations)

+5
source share
2 answers

Using cudaMemcpythe CUDA driver detects that you are copying a node pointer to a pointer to a host, and the copy is being executed on the CPU. Of course, you can use memcpy on the processor if you want.

If you use cudaMemcpy, an additional synchronization of the stream (which you can see in the profiler, but I assume that there is a test there and see) can be performed before executing the copy.

UVA cudaMemcpyDefault, . UVA (sm_20 + 64- ), (, cudaMemcpyDeviceToDevice). cudaHostRegister() , , cudaMemcpyDeviceToDevice , :

  • Host ↔ Host: CPU (memcpy)
  • Host ↔ : DMA ( )
  • ↔ : ​​Memcpy CUDA ( SM, )
+3

UVA ( ), cudaMemcpy cudaMemcpyDefault. , API, .

+2

All Articles