Skip to content

markhembrow/nccl-mig-patch

Repository files navigation

NCCL A100 MIG Compatibility Patch

Overview

This repository contains a patched version of NCCL specifically tuned for NVIDIA A100 MIG (Multi-Instance GPU) environments and heterogeneous Ampere systems (e.g., A40).

The Fix: A100 MIG Memory Pool Bypass

On certain A100 MIG configurations, the CUDA driver rejects the cudaMemPoolCreate call during NCCL initialization with the error: cuda failure 'operation not supported' (occurring in src/init.cc).

This patch bypasses the memory pool creation by forcing comm->memPool = nullptr, allowing NCCL to fall back to standard memory allocation paths which are supported by the MIG driver.

Changes

  • Modified src/init.cc to remove the cudaMemPoolCreate block.

Build Instructions

To build for Ampere architecture (SM80):

make clean
make -j$(nproc) NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80"

Performance Tuning for MIG & Heterogeneous Clusters

When running benchmarks (e.g., nccl-tests) across MIG partitions and remote A40 nodes, the following environment variables are recommended to ensure stability and bypass driver restrictions:

export NCCL_P2P_DISABLE=1        # Disable Peer-to-Peer (MIG restriction)
export NCCL_SHM_DISABLE=1        # Disable Shared Memory (MIG restriction)
export NCCL_NET_GDR_LEVEL=0      # Disable GPU Direct RDMA
export NCCL_MIG_MODE=1           # Enable MIG mode
export NCCL_SOCKET_IFNAME=ens33  # Set specific network interface

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors