Reduce code redundancy between CPU and CUDA implementation
Currently kernels are implemented twice, meaning that if we modify, e.g., momentumAndEnergyIAD.hpp, then we also need to modify cuda/cudaMomentumAndEnergyIAD.cu.
However the code does the same thing for every particle.
For every computeXXX function in sph-exa, we should have a:
namespace kernel{
inline void kernel::computeXXX(int pi, int *clist, ...)
}
function that takes the particle index as a parameter and only does the computation for that one particle. This function should only accept simple variables and raw pointers (by copy), and no references.
Basically, this function should usable both by OpenMP, OpenACC, and CUDA.
The workflow is something like this:
computeDensity(taskList) -> calls computeDensity(task) -> calls inline computeDensity(particleArray) -> calls inline kernel::computeDensity(int pi, int *clist, ...)
computeDensity(task) will handle data movement for OpenMP / OpenACC offloading / CUDA computeDensity(particleArray) will handle omp / acc directives / CUDA kernel launch
kernel::computeDensity(int pi, int *clist) is identical for all models. Data movement and CUDA kernel launch are handled separately in computeDensity(task) and computeDensity(particleArray).
The easiest way to do this is probably by starting from the existing CUDA code, which is the most constrained.
The challenge is to compile the CUDA parts independently with nvcc. I am thinking of using a simple #include to import the kernel::computeXXX function. Code structure should look like:
include/sph/
density.hpp: contains computeDensity(taskList) as well as CPU implementations of computeDensity(task) and computeDensity(particleArray)
cuda/
density.cu: contains CUDA implementations of computeDensity(task) and computeDensity(particleArray)
kernel/
density.hpp: contains kernel::computeDensity(int pi, int *clist, ...)
kernel/density.hpp is included both in sph/density.hpp and sph/cuda/density.cu.
Of course, we want the same pattern for all computeXXX functions, not just density.