********************* NCCL interoperability ********************* There are several challenges with supporting NCCL. * For multi-process NCCL, we need a way to share a unique ID between processes so that the different processes know they're part of the same NCCL communicator. For the time being, this is accomplished via MPI in the nccl tests. * NCCL has the concept of a non-blocking communicator. This causes all NCCL operations to potentially return ``ncclInProgress`` BEFORE they actually put an GPU operations into streams. This means we can't just synchronize on a CUDA stream in the NCCL backend's ``wait`` implementation. Either: * our NCCL operations need to effectively become blocking (checking NCCL's async status thing until it's no longer in progress) * our ``wait`` implementation needs to do that * It is not guaranteed to be safe for NCCL operations and GPU-aware MPI operations to be simultaneously active on the same set of GPUs. This is a challenge both if we permit multiple backends to exist, and for interop with existing MPI and/or NCCL applications.