GPU Offloading: Overlap communications and computations
Implement async memory copies and kernel launches.
This is already possible in both CUDA OpenMP and OpenACC.
Proof of concept with CUDA by @sebkelle1 : https://github.com/unibas-dmi-hpc/SPH-EXA_mini-app/blob/hpxDist/src/include/sph/cuda/cudaDensity.cu
We should address issue #36 first.