Multi-GPU Support#

Kokkos has support for launching kernels on multiple GPUs from a single host process, e.g., a single MPI rank. This feature currently exists for the CUDA, HIP, and SYCL backends.

Using this feature requires knowledge of backend specific API calls for creating non-default execution space instances. Once the execution space has been created, it can be used to create execution policies, allocate views, etc. all on the device chosen by the user.

Constructing Execution Spaces#

CUDA#

For CUDA backend, the user creates a cudaStream_t object and passes it to the Kokkos::Cuda constructor.

Note

The lifetime of all Kokkos objects related to the stream (e.g., the execution space) needs to end before the stream itself is invalidated via cudaStreamDestroy. This can be done by adding a scope around such objects, as is done below.

// Query number of devices available
int n_devices;
cudaGetDeviceCount(&n_devices);

// Choose some 0 <= N < n_devices
int N = ...;

// Create stream on device N
cudaStream_t stream;
cudaSetDevice(N);
cudaStreamCreate(&stream);

// Scope execution space to ensure stream
// is destroyed *after* execution space
{
  // Create execution space
  Kokkos::Cuda exec_space(stream);

  // Use execution space
  /* ... */
}

// Destroying stream after use
cudaSetDevice(N);
cudaStreamDestroy(stream);

HIP#

For the HIP backend, like with CUDA, the user creates a hipStream_t object and passes it to the Kokkos::HIP constructor.

Note

The lifetime of all Kokkos objects related to the stream (e.g., the execution space) needs to end before the stream itself is invalidated via hipStreamDestroy. This can be done by adding a scope around such objects, as is done below.

Warning

Multi-GPU only supported for ROCm 5.6 and later. Because of the lack of HIP API functions for querying a stream’s device before ROCm 5.6, constructing a Kokkos::HIP instance on a non-default device isn’t supported.

// Query number of devices available
int n_devices;
hipGetDeviceCount(&n_devices);

// Choose some 0 <= N < n_devices
int N = ...;

// Create stream on device N
hipStream_t stream;
hipSetDevice(N);
hipStreamCreate(&stream);

// Scope execution space to ensure stream
// is destroyed *after* execution space
{
  // Create execution space
  Kokkos::HIP exec_space(stream);

  // Use execution space
  /* ... */
}

// Destroying stream after use
hipSetDevice(N);
hipStreamDestroy(stream);

SYCL#

For the SYCL backend, the user creates a sycl::queue object and passes it to the Kokkos::SYCL constructor.

// Get list of devices available
std::vector<sycl::device> gpu_devices =
  sycl::device::get_devices(sycl::info::device_type::gpu);

// Choose some 0 <= N < gpu_devices.size()
int N = ...;

// Create a queue on device N.
// Note: Kokkos requires SYCL queues to be "in_order"
sycl::queue queue{gpu_devices[N], sycl::property::queue::in_order()};

// Create execution space
Kokkos::SYCL exec_space(queue);

// Use execution space
/* ... */

Using Kokkos Methods#

Once an execution space has been created on the chosen device, the execution space must be passed to all Kokkos methods intended to be used on the chosen device. If no execution space is passed, Kokkos will use the default execution space instance associated with the device with which Kokkos was initialized.

Allocating Managed Views#

To allocate a managed view on device, pass the execution space to Kokkos::view_alloc().

Example:

using ExecutionSpace = decltype(exec_space);
Kokkos::View<int*, typename ExecutionSpace::memory_space> V(Kokkos::view_alloc("V", exec_space), 10);

Launching Kernels#

To launch a kernel on device, pass the execution space to the policy constructor.

Example:

Kokkos::parallel_for("inc_V", Kokkos::RangePolicy(exec_space, 0, 10),
  KOKKOS_LAMBDA (const int i) {
    V(i) += i;
});

Notes#

  • A tutorial for using multi-GPU on CUDA is available.