Skip to content

[CP 1567] docs(test-runner): add MI350P test recipe support matrix#587

Merged
spraveenio merged 2 commits into
ROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1567.rocm.gpu-operator.main
Jul 1, 2026
Merged

[CP 1567] docs(test-runner): add MI350P test recipe support matrix#587
spraveenio merged 2 commits into
ROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1567.rocm.gpu-operator.main

Conversation

@ci-penbot-01

Copy link
Copy Markdown
Contributor

cp of pensando/gpu-operator#1567


Source PR Description (pensando/gpu-operator#1567):

References

  • Related: pensando/device-metrics-exporter#1410, #1421

Motivation

  • MI350P is now supported by the test runner with TheRock 7.14 tarball builds
  • Recipe support was validated on real hardware (3× AMD Instinct MI350P, device 0x75a8)
  • Documentation needs to reflect which recipes are available and verified

Plan

  • docs/test/appendix-test-recipe.md: add MI350P row to the RVS Instinct GPU table
  • docs/test/agfhc.md: add MI350P row to the AGFHC recipe matrix and partition profile table

MI350P verified recipes

RVS (TheRock 7.14.0rc0 + RVS 1.4.24, MI350P-450W and MI350P-600W TDP variants):

Recipe Status
babel_single ✅ PASS
gst_single ✅ PASS
iet_stress ✅ Available (hardware-level results vary by node health)

AGFHC (AGFHC 1.32.0):

Recipe Status
all_lvl1all_lvl4 ✅ Verified
all_lvl5 / single_pass Not marked — minihpl binary incompatible with TheRock 7.14 rocblas kernel layout, pending AGFHC team fix
gfx_lvl1gfx_lvl4 ✅ Verified
hbm_lvl1hbm_lvl4 ✅ Verified
dma_lvl1dma_lvl4 ✅ Verified
pcie_lvl1pcie_lvl4 ✅ Verified
all_perf ✅ Verified
thermal ✅ Verified (~2h run)

Risks / Limitations

  • N/A (documentation only)

Cherrypick triggered by: ACP-Automation

yansun1996 and others added 2 commits June 30, 2026 16:20
Add MI350P to both the RVS appendix and AGFHC recipe tables.

RVS: split into MI350P-450W and MI350P-600W TDP variants, both
verified on real hardware (3x AMD Instinct MI350P, device 0x75a8,
TheRock 7.14.0rc0 + RVS 1.4.24):
- babel_single: PASS
- gst_single: PASS
- iet_stress: available

AGFHC (AGFHC 1.32.0, verified on MI350P node):
- all_lvl1 through all_lvl4: verified
- all_lvl5 / single_pass: not marked (minihpl binary incompatibility
  with TheRock 7.14 rocblas kernel layout — pending AGFHC team fix)
- gfx_lvl1 through gfx_lvl4: verified
- hbm_lvl1 through hbm_lvl4: verified
- dma_lvl1 through dma_lvl4: verified
- pcie_lvl1 through pcie_lvl4: verified
- all_perf, thermal: verified
- No xgmi, burnin, hsio, rochpl_isolation, hbm_lvl5 for MI350P

(cherry picked from commit 2f83516d5e09c3e436584aa2aecdf9b1cd83a404)

@spraveenio spraveenio left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@spraveenio spraveenio left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@spraveenio spraveenio merged commit 5fad372 into ROCm:main Jul 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants