I tried to combine kernel execution with memcpyasync, but it does not work. I follow all the recommendations in the programming guide using fixed memory, different threads, etc. I see that kernel execution overlaps, but it is not related to mem transfers. I know that my card has only one copy mechanism and one execution mechanism, but execution and transitions should overlap, right?
It seems that the “copy mechanism” and the “execution mechanism” always apply the order that I call functions. The work consists of 4 threads executing [HtoD x2, Kernel, DtoH]. If I release HtoDx2, Kernel, DtoH serie in each stream, I see in the profiler how the first operation of stream2 HtoD will not start until the first DtoH operation is completed. If I first produce an HtoD in each thread, then the second HtoD, then the kernel, and then DtoH (width), I do not see any matches, and the output order is also provided by the GPU.
I tried with the simpleStreams example specified in the CUDA SDK and also see the same behavior.
I am attaching some screenshots showing the problem in both the visual profiler and Nsight for VS2008.
ps. I did not set CUDA_LAUNCH_BLOCKING env
Transparent Simple Streams Proxy

MyApp Nsight

- MyApp Nsight

x4 ( 2HtoD, 5 , 1DtoH ) → nvprof --concurrent-kernels-off, . env CUDA_LAUNCH_BLOCKING = 1, ( ) 7,5%!
:
- Windows 7
- NVIDIA 6800 VGA PCI-E
- GTX480 PCI-E
- NVIDIA: 306.94
- Visual studio 2008
- CUDA v5.0
- Visual Profiler 5.0
- Nsight 3.0