GPU Offloading: Overlap communications and computations

Implement async memory copies and kernel launches.

This is already possible in both CUDA OpenMP and OpenACC.

Proof of concept with CUDA by @sebkelle1 : https://github.com/unibas-dmi-hpc/SPH-EXA_mini-app/blob/hpxDist/src/include/sph/cuda/cudaDensity.cu

We should address issue #36 first.