Support Data Center precompiled driver container for Arm (Ubuntu 24.04)#533
Support Data Center precompiled driver container for Arm (Ubuntu 24.04)#533shivakunv wants to merge 2 commits into
Conversation
6405d48 to
574ce43
Compare
20726a8 to
46aa0d1
Compare
c008150 to
b684015
Compare
There was a problem hiding this comment.
Pull request overview
This pull request adds ARM64 (aarch64) platform support to the Ubuntu 24.04 precompiled driver container builds, while maintaining AMD64 as the default architecture. The changes enable multi-platform Docker builds and update the CI/CD pipeline to handle both architectures.
Changes:
- Added ARM64 platform support for Ubuntu 24.04 precompiled driver containers with architecture-specific package handling
- Updated CI workflow to build, test, and publish both AMD64 and ARM64 artifacts with platform-specific suffixes
- Modified Holodeck test infrastructure to support ARM64 instances (g5g.xlarge in us-west-2) and Ubuntu 24.04 OS specification
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| ubuntu24.04/precompiled/nvidia-driver | Added conditional installation of libnvidia-fbc1 package (AMD64 only) |
| ubuntu24.04/precompiled/local-repo.sh | Added conditional downloads for ARM64-incompatible packages (linux-signatures-nvidia, libnvidia-fbc1) |
| ubuntu24.04/precompiled/Dockerfile | Made i386 architecture and CUDA repository URLs conditional based on target architecture |
| tests/scripts/findkernelversion.sh | Added optional PLATFORM_SUFFIX parameter for artifact matching and platform-specific manifest inspection |
| tests/scripts/ci-precompiled-helpers.sh | Added PLATFORM_SUFFIX parameter support for kernel version testing |
| tests/holodeck_ubuntu24.04.yaml | Removed file (merged into holodeck_ubuntu.yaml) |
| tests/holodeck_ubuntu.yaml | Removed hardcoded ingressIpRanges and AMI, added OS specification support |
| multi-arch.mk | Removed AMD64-only platform restriction for ubuntu24.04 builds |
| Makefile | Added DOCKER_BUILD_PLATFORM_OPTIONS to base image build targets |
| .github/workflows/precompiled.yaml | Added platform matrix dimension, platform-aware artifact naming, ARM64 e2e testing with appropriate instance types, and Holodeck version update |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ee1265d to
49429dd
Compare
32e68a1 to
cdbfe9a
Compare
cdbfe9a to
e224399
Compare
2f00f8b to
4a75c51
Compare
4a75c51 to
c7ce51a
Compare
8b1afd4 to
dd078b8
Compare
| # Fetch GPG keys for CUDA repo | ||
| RUN apt-key del 3bf863cc && \ | ||
| # Fetch GPG keys for CUDA repo (architecture-specific) | ||
| RUN CUDA_ARCH=$([ "$TARGETARCH" = "arm64" ] && echo "sbsa" || echo "x86_64") && \ |
There was a problem hiding this comment.
Why are we using sbsa? If I remember correctly, sbsa is specifically for Tegra-based arm64 machines
There was a problem hiding this comment.
There was a problem hiding this comment.
Please elaborate. "followed doc" is not a helpful response
There was a problem hiding this comment.
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
attaching the supported distro table:
statement:
Cross development for arm64-sbsa is supported on Ubuntu 20.04, Ubuntu 22.04, Ubuntu 24.04, KylinOS 10, Red Hat Enterprise Linux 8, Red Hat Enterprise Linux 9, and SUSE Linux Enterprise Server 15.
Cross development for arm64-sbsa-jetson is only supported on Ubuntu 24.04.
Table 1 Supported Linux Distributions
Table 2 Native Linux Distribution Support and Validated OS Versions for CUDA 13.3

dd078b8 to
783e783
Compare
783e783 to
d44324e
Compare
d44324e to
7d8aff1
Compare
7d8aff1 to
c035d23
Compare
34d0170 to
0627494
Compare
There was a problem hiding this comment.
Is matrix.kernel_version the right suffix here or should it be env.KERNEL_VERSION?
There was a problem hiding this comment.
should be env.KERNEL_VERSION . done
| if [[ "${{ matrix.dist }}" == "ubuntu24.04" ]] && [[ "${{ matrix.flavor }}" != "azure-fde" ]]; then | ||
| export DOCKER_BUILD_PLATFORM_OPTIONS="--platform=linux/amd64,linux/arm64" | ||
| else | ||
| export DOCKER_BUILD_PLATFORM_OPTIONS="--platform=linux/amd64" | ||
| fi |
There was a problem hiding this comment.
This check is repeated in quite a few places. Can we move this to multi-arch.mk? There are already single arch overrides in that file.
# add after the existing single-arch overrides
ifeq ($(KERNEL_FLAVOR),azure-fde)
build-signed_ubuntu24.04%: DOCKER_BUILD_PLATFORM_OPTIONS = platform=linux/amd64
endif
This can then become:
run: |
source kernel_version.txt
export DOCKER_BUILD_OPTIONS="--output=type=oci,dest=./driver-images-...tar"
make DRIVER_VERSIONS=${DRIVER_VERSIONS} DRIVER_BRANCH=${{ matrix.driver_branch }} \
KERNEL_FLAVOR=${{ matrix.flavor }} \
KERNEL_VERSION=${KERNEL_VERSION} build-${DIST}-${DRIVER_VERSION}
| # Convert array to JSON format and assign | ||
| echo "[]" > ./matrix_values_${{ matrix.dist }}_${{ matrix.lts_kernel }}.json | ||
| printf '%s\n' "${KERNEL_VERSIONS[@]}" | jq -R . | jq -s . > ./matrix_values_${{ matrix.dist }}_${{ matrix.lts_kernel }}.json | ||
| platforms_json='${{ needs.set-driver-version-matrix.outputs.platforms }}' |
There was a problem hiding this comment.
Please extract this into a separate helper. Inline scripting is hard to read.
It can look something like this:
- name: Set kernel version
env:
KERNEL_FLAVORS_JSON: ${{ needs.set-driver-version-matrix.outputs.kernel_flavors
}}
DRIVER_BRANCHES_JSON: ${{ needs.set-driver-version-matrix.outputs.driver_branch
}}
EXCLUDE_PAIRS_JSON: ${{
needs.set-driver-version-matrix.outputs.exclude_build_matrix_pairs }}
PLATFORMS_JSON: ${{ needs.set-driver-version-matrix.outputs.platforms }}
run: ./tests/scripts/build-kernel-matrix.sh "${{ matrix.dist }}" "${{
matrix.lts_kernel }}"
build-kernel-matrix.sh can have something like this (untested):
#!/bin/bash
# Args: DIST LTS_KERNEL (reads KERNEL_FLAVORS_JSON, DRIVER_BRANCHES_JSON,
# EXCLUDE_PAIRS_JSON, PLATFORMS_JSON from env)
set -euo pipefail
DIST="$1"; LTS_KERNEL="$2"
mapfile -t KERNEL_FLAVORS < <(jq -r '.[]' <<<"$KERNEL_FLAVORS_JSON")
mapfile -t PLATFORMS < <(jq -r '.[]' <<<"$PLATFORMS_JSON")
DRIVER_BRANCHES=()
for b in $(jq -r '.[]' <<<"$DRIVER_BRANCHES_JSON"); do
jq -e --arg dist "$DIST" --arg b "$b" \
'any(.[]; .dist==$dist and .driver_branch==$b)' <<<"$EXCLUDE_PAIRS_JSON" \
>/dev/null || DRIVER_BRANCHES+=("$b")
done
source ./tests/scripts/ci-precompiled-helpers.sh
for platform in "${PLATFORMS[@]}"; do
[[ "$platform" == arm64 && "$DIST" == ubuntu22.04 ]] && continue
suffix=""; flavors=("${KERNEL_FLAVORS[@]}")
if [[ "$platform" == arm64 ]]; then
suffix="-arm64"
flavors=( "${KERNEL_FLAVORS[@]/azure-fde}" ) # remove azure-fde
flavors=( "${flavors[@]}" ) # compact array
fi
versions=( $(get_kernel_versions_to_test flavors[@] DRIVER_BRANCHES[@] \
"$DIST" "$LTS_KERNEL" "$suffix") )
[[ -n "${versions[*]:-}" ]] && \
printf '%s\n' "${versions[@]}" | jq -R . | jq -s . \
> "./matrix_values_${DIST}_${LTS_KERNEL}${suffix}.json"
done| libnvidia-encode-${DRIVER_BRANCH}-server \ | ||
| libnvidia-fbc1-${DRIVER_BRANCH}-server \ | ||
| libnvidia-gl-${DRIVER_BRANCH}-server | ||
| libnvidia-encode-${DRIVER_BRANCH}-server |
There was a problem hiding this comment.
Consider splitting this into one userspace install and one kernel module install:
# Userspace packages
USERSPACE=(
nvidia-utils-${DRIVER_BRANCH}-server
nvidia-headless-no-dkms-${DRIVER_BRANCH}-server
libnvidia-decode-${DRIVER_BRANCH}-server
libnvidia-extra-${DRIVER_BRANCH}-server
libnvidia-encode-${DRIVER_BRANCH}-server
libnvidia-gl-${DRIVER_BRANCH}-server
)
if [ "$TARGETARCH" = "amd64" ]; then
# libnvidia-fbc1 is not published for arm64
USERSPACE+=( libnvidia-fbc1-${DRIVER_BRANCH}-server )
fi
# Install userspace packages
apt-get install -y --no-install-recommends "${USERSPACE[@]}"
# Kernel modules
if [ "$KERNEL_TYPE" = "kernel-open" ]; then
KMOD=( linux-modules-nvidia-${DRIVER_BRANCH}-server-open-${KERNEL_VERSION} )
else
KMOD=(
linux-objects-nvidia-${DRIVER_BRANCH}-server-${KERNEL_VERSION}
linux-modules-nvidia-${DRIVER_BRANCH}-server-${KERNEL_VERSION}
)
fi
if [ "$TARGETARCH" = "amd64" ]; then
# secure-boot signatures are not published for arm64
KMOD+=( linux-signatures-nvidia-${KERNEL_VERSION} )
fi
# Install kernel modules
apt-get install -y --no-install-recommends "${KMOD[@]}"There was a problem hiding this comment.
cosmetic change. I will handle it in a separate PR.
A similar update is needed for other distro (ubuntu22.04, rhel) as well.
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
2d7ac2e to
4ecbadb
Compare
| # build-ubuntu22.04-$(DRIVER_VERSION) triggers a build for a specific $(DRIVER_VERSION) | ||
| $(DISTRIBUTIONS): %: build-% | ||
| $(BUILD_TARGETS): %: $(foreach driver_version, $(DRIVER_VERSIONS), $(addprefix %-, $(driver_version))) | ||
| DRIVER_BUILD_TAG = $(if $(findstring type=oci,$(DOCKER_BUILD_OPTIONS)),,--tag $(IMAGE)) |
There was a problem hiding this comment.
This is the not the right variable name
There was a problem hiding this comment.
done . used DOCKER_BUILD_TAG_OPTION
| linux-signatures-nvidia-${KERNEL_VERSION} \ | ||
| linux-modules-nvidia-${DRIVER_BRANCH}-server-${KERNEL_VERSION} | ||
| if [ "$TARGETARCH" = "amd64" ]; then | ||
| apt-get install --no-install-recommends -y \ |
There was a problem hiding this comment.
You can reduce the duplication here by conditionally installing just linux-objects-nvidia-${DRIVER_BRANCH}-server-${KERNEL_VERSION}
| # Fetch GPG keys for CUDA repo | ||
| RUN apt-key del 3bf863cc && \ | ||
| # Fetch GPG keys for CUDA repo (architecture-specific) | ||
| RUN CUDA_ARCH=$([ "$TARGETARCH" = "arm64" ] && echo "sbsa" || echo "x86_64") && \ |
There was a problem hiding this comment.
Please elaborate. "followed doc" is not a helpful response
| pattern: driver-images-*-${{ env.KERNEL_VERSION }}-${{ env.DIST }}* | ||
| path: ./tests/ | ||
| merge-multiple: true | ||
| - name: Install skopeo |
There was a problem hiding this comment.
Line 478 (Line: 356): Pushing the multi-arch oci-archive to the registry as a manifest list.
The github actions runner's docker load --platform exists in the cli but the daemon can not hold multi-arch images.
docker load + docker push would mean loop per platform, push each, then docker manifest create to stitch the manifest list.
skopeo copy streams oci-archive: registry preserving multi-arch in one step.
Line 379: Build outputs one multi-arch oci-archive (amd64 + arm64). The e2e test needs a single platform as docker-archive (for docker load).
On the github actions runner's docker save --platform does not help either, so without skopeo we would have to build amd64 and arm64 as separate single arch images. separate artifacts need to upload on github.
skopeo copy --override-arch extracts the platform we need.
consistency: same tool in both places.
From what I recall, tried regctl earlier and hit an oci-archive error. github pipeline logs have been cleared since three months have passed. can reinvestigate if preferred.
Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
6d37464 to
6b58bd4
Compare
Code Changes Summary:
Platform Support
Added support for the ARM64 platform.
AMD64 remains the default architecture.
Artifacts Update
ARM64 build artifacts are now uploaded with the -arm64 suffix.
Instance Type and Region Mapping
g4dn.xlarge:
Architecture: AMD64
Supported Region: us-west-1
Used for AMD64 builds.
g5g.xlarge:
Architecture: ARM64
Supported Region: us-west-2
Used for ARM64 builds.
Fixes https://github.com/NVIDIA/cloud-native-team/issues/276
passed pipeline: https://github.com/NVIDIA/gpu-driver-container/actions/runs/22180871853
passed pipeline: https://github.com/NVIDIA/gpu-driver-container/actions/runs/22337833186