From fff22e05702830b6db1ae7e109ff59719105f220 Mon Sep 17 00:00:00 2001 From: Yan Sun Date: Tue, 30 Jun 2026 09:19:17 -0700 Subject: [PATCH 1/2] docs(test-runner): add MI350P test recipe support matrix (#1567) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add MI350P to both the RVS appendix and AGFHC recipe tables. RVS: split into MI350P-450W and MI350P-600W TDP variants, both verified on real hardware (3x AMD Instinct MI350P, device 0x75a8, TheRock 7.14.0rc0 + RVS 1.4.24): - babel_single: PASS - gst_single: PASS - iet_stress: available AGFHC (AGFHC 1.32.0, verified on MI350P node): - all_lvl1 through all_lvl4: verified - all_lvl5 / single_pass: not marked (minihpl binary incompatibility with TheRock 7.14 rocblas kernel layout — pending AGFHC team fix) - gfx_lvl1 through gfx_lvl4: verified - hbm_lvl1 through hbm_lvl4: verified - dma_lvl1 through dma_lvl4: verified - pcie_lvl1 through pcie_lvl4: verified - all_perf, thermal: verified - No xgmi, burnin, hsio, rochpl_isolation, hbm_lvl5 for MI350P (cherry picked from commit 2f83516d5e09c3e436584aa2aecdf9b1cd83a404) --- .../2026-06-30-mi350p-test-recipe-docs.md | 32 +++++++++++++++++++ docs/test/agfhc.md | 1 + docs/test/appendix-test-recipe.md | 4 ++- 3 files changed, 36 insertions(+), 1 deletion(-) create mode 100644 docs-internal/knowledge/plans/2026-06-30-mi350p-test-recipe-docs.md diff --git a/docs-internal/knowledge/plans/2026-06-30-mi350p-test-recipe-docs.md b/docs-internal/knowledge/plans/2026-06-30-mi350p-test-recipe-docs.md new file mode 100644 index 000000000..7e84c67d8 --- /dev/null +++ b/docs-internal/knowledge/plans/2026-06-30-mi350p-test-recipe-docs.md @@ -0,0 +1,32 @@ +# MI350P Test Recipe Documentation + +- **Date:** 2026-06-30 +- **Author:** yan.sun3@amd.com +- **Related PR(s):** #1567 +- **Related issue(s) / JIRA:** N/A + +## Context + +MI350P (AMD Instinct MI350P, device 0x75a8) support was added to the +test runner in device-metrics-exporter PR #1410 and #1421. Recipes were +validated on real hardware (3× AMD Instinct MI350P node, TheRock 7.14.0rc0, +RVS 1.4.24, AGFHC 1.32.0). The gpu-operator docs need to reflect which +recipes are available and verified for MI350P. + +## Approach + +Two documentation files updated: + +**`docs/test/appendix-test-recipe.md`** (RVS appendix): +- MI350P ships in two TDP variants with separate RVS recipe folders + (`MI350P-450W` and `MI350P-600W`), so it is listed as two rows +- Both variants support: `babel_single`, `gst_single`, `iet_stress` + +**`docs/test/agfhc.md`** (AGFHC recipe matrix): +- MI350P row added with verified recipes marked +- `all_lvl5` and `single_pass` intentionally left blank — `minihpl` + binary in AGFHC 1.32.0 is incompatible with TheRock 7.14 rocblas + kernel layout; pending fix from AGFHC team +- `hbm_lvl5`, xgmi, burnin, hsio, rochpl_isolation left blank — these + recipes do not exist for MI350P in AGFHC 1.32.0 +- MI350P not added to partition profile table — not validated diff --git a/docs/test/agfhc.md b/docs/test/agfhc.md index f9712ae73..93449d90c 100644 --- a/docs/test/agfhc.md +++ b/docs/test/agfhc.md @@ -69,6 +69,7 @@ Here is the AGFHC test recipe support matrix and brief introduction to each reci | MI308X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | MI308X-HF | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | MI325X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | +| MI350P | ✓ | ✓ | ✓ | ✓ | | ✓ | | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | | | | | | | | | | | MI350X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | ✓ | | | | | | | MI355X | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | ✓ | | | | | | diff --git a/docs/test/appendix-test-recipe.md b/docs/test/appendix-test-recipe.md index e3e74bded..1d71850c5 100644 --- a/docs/test/appendix-test-recipe.md +++ b/docs/test/appendix-test-recipe.md @@ -5,7 +5,7 @@ The test runner's test recipes are built upon ROCm Validation Suite (RVS). Here is a full list of supported test recipes by RVS. | GPU | babel | gpup_single | gst_single | iet_single | pbqt_single | pebb_single | tst_single | gst_ext | gst_selfcheck | gst_stress | iet_stress | gst_thermal | iet_thermal | -|-----------|-------|-------------|------------|------------|-------------|-------------|------------|---------|---------------|------------|------------|-------------|-------------| +|--------------|-------|-------------|------------|------------|-------------|-------------|------------|---------|---------------|------------|------------|-------------|-------------| | MI210 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | | | | | | | MI300X | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ | ✓ | | | | MI300A | | | | | | ✓ | | | | | ✓ | | | @@ -13,6 +13,8 @@ The test runner's test recipes are built upon ROCm Validation Suite (RVS). Here | MI308X | ✓ | | ✓ | ✓ | | | | | | | ✓ | ✓ | ✓ | | MI308X-HF | ✓ | | ✓ | | | | | | | | ✓ | ✓ | ✓ | | MI325X | ✓ | | ✓ | | ✓ | ✓ | | | | | ✓ | | | +| MI350P-450W | ✓ | | ✓ | | | | | | | | ✓ | | | +| MI350P-600W | ✓ | | ✓ | | | | | | | | ✓ | | | | MI350X | ✓ | | ✓ | | ✓ | ✓ | | | | | ✓ | | | | MI355X | ✓ | | ✓ | | ✓ | ✓ | | | | | ✓ | | | From 5c620a86951dde91f69d36a1013781c4a64ad141 Mon Sep 17 00:00:00 2001 From: Praveen Kumar Shanmugam <58961022+spraveenio@users.noreply.github.com> Date: Tue, 30 Jun 2026 09:57:43 -0700 Subject: [PATCH 2/2] Delete docs-internal/knowledge/plans/2026-06-30-mi350p-test-recipe-docs.md --- .../2026-06-30-mi350p-test-recipe-docs.md | 32 ------------------- 1 file changed, 32 deletions(-) delete mode 100644 docs-internal/knowledge/plans/2026-06-30-mi350p-test-recipe-docs.md diff --git a/docs-internal/knowledge/plans/2026-06-30-mi350p-test-recipe-docs.md b/docs-internal/knowledge/plans/2026-06-30-mi350p-test-recipe-docs.md deleted file mode 100644 index 7e84c67d8..000000000 --- a/docs-internal/knowledge/plans/2026-06-30-mi350p-test-recipe-docs.md +++ /dev/null @@ -1,32 +0,0 @@ -# MI350P Test Recipe Documentation - -- **Date:** 2026-06-30 -- **Author:** yan.sun3@amd.com -- **Related PR(s):** #1567 -- **Related issue(s) / JIRA:** N/A - -## Context - -MI350P (AMD Instinct MI350P, device 0x75a8) support was added to the -test runner in device-metrics-exporter PR #1410 and #1421. Recipes were -validated on real hardware (3× AMD Instinct MI350P node, TheRock 7.14.0rc0, -RVS 1.4.24, AGFHC 1.32.0). The gpu-operator docs need to reflect which -recipes are available and verified for MI350P. - -## Approach - -Two documentation files updated: - -**`docs/test/appendix-test-recipe.md`** (RVS appendix): -- MI350P ships in two TDP variants with separate RVS recipe folders - (`MI350P-450W` and `MI350P-600W`), so it is listed as two rows -- Both variants support: `babel_single`, `gst_single`, `iet_stress` - -**`docs/test/agfhc.md`** (AGFHC recipe matrix): -- MI350P row added with verified recipes marked -- `all_lvl5` and `single_pass` intentionally left blank — `minihpl` - binary in AGFHC 1.32.0 is incompatible with TheRock 7.14 rocblas - kernel layout; pending fix from AGFHC team -- `hbm_lvl5`, xgmi, burnin, hsio, rochpl_isolation left blank — these - recipes do not exist for MI350P in AGFHC 1.32.0 -- MI350P not added to partition profile table — not validated