Skip to content

[CP 1557] feat: add bmc-einj-enable and ras-inject-test skills for RAS error injection testing#586

Open
ci-penbot-01 wants to merge 1 commit into
ROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1557.rocm.gpu-operator.main
Open

[CP 1557] feat: add bmc-einj-enable and ras-inject-test skills for RAS error injection testing#586
ci-penbot-01 wants to merge 1 commit into
ROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1557.rocm.gpu-operator.main

Conversation

@ci-penbot-01

Copy link
Copy Markdown
Contributor

cp of pensando/gpu-operator#1557


Source PR Description (pensando/gpu-operator#1557):

Summary

Add two new Claude Code skills for hardware RAS error injection testing on AMD GPUs, with end-to-end validation against real hardware and Confluence reporting.

/bmc-einj-enable — BMC EINJ Enablement via Redfish

Automates the multi-step process of enabling Error Injection (EINJ) on AMD GPU OAM slots through the BMC Redfish API. This is a prerequisite before amdgpuras can inject hardware RAS errors.

What it does:

  1. Iterates OAM_0 through OAM_7 to discover which slots expose EINJState
  2. Checks both Oem.EINJState and Oem.Ami.AMD.EINJState paths (varies by firmware)
  3. POSTs {"ErrInjection": "Enable"} to each disabled slot
  4. Issues FullPowerCycle via Redfish (GracefulRestart is NOT sufficient for EINJ activation)
  5. Polls power state and verifies EINJState = "Enabled" after reboot

Supported BMC vendors: SMCI (Supermicro) — confirmed working.

Invocation: /bmc-einj-enable <BMC_IP> <USERNAME> <PASSWORD>

/ras-inject-test — RAS Error Injection Testing with Cross-Validation

Runs end-to-end RAS error injection tests using amdgpuras, cross-validates ECC counters between amd-smi (hardware ground truth) and the Device Metrics Exporter (DME), collects AFID data, and generates structured test reports published to Confluence.

What it does:

  1. Gathers system info: driver version, ROCm version, amd-smi version, amdgpuras version, GPU series, DME version
  2. Discovers injectable blocks via amdgpuras -l
  3. For each GPU × block: captures baseline from amd-smi + DME, injects error, waits 35s, captures post-injection, cross-validates per-block and total counters
  4. Collects AFID data via amd-smi ras --cper and correlates with gpu_afid_errors metric
  5. Generates a markdown report and publishes to Confluence under a release-based page hierarchy

Cross-validation approach:

  • amd-smi metric --ecc-block is the ground truth (per-block HW counters)
  • DME Prometheus metrics (gpu_ecc_uncorrect_<block>, gpu_ecc_uncorrect_total, gpu_health) are the system under test
  • Results: PASS (both match), PARTIAL (amd-smi changed but DME didn't = DME bug), FAIL, RESET, SKIP

Invocation: /ras-inject-test <HOST_IP> <USER> <PASS> [--release v1.5.1] [--blocks gfx,mmhub]

Files Changed

File Change
.claude/skills/bmc-einj-enable/SKILL.md New skill definition
.claude/skills/ras-inject-test/SKILL.md New skill definition
.claude/commands/bmc-einj-enable.md Symlink to skill
.claude/commands/ras-inject-test.md Symlink to skill
.claude/skills/README.md Added both skills under "Hardware / BMC" section
docs-internal/knowledge/plans/2026-06-24-bmc-einj-enable-skill.md Plan file
docs-internal/knowledge/plans/2026-06-24-ras-inject-test-skill.md Plan file

Validation — Live Hardware Testing

Tested on smci350-rck-g03-b19-03 (SMCI, MI350X, 8 GPUs, driver 6.16.13, ROCm 7.2.1, DME exporter-0.0.1-342).

EINJ Enablement

  • All 8 OAM slots: EINJState changed from Disabledpending to enableEnabled after FullPowerCycle

RAS Injection Results (GPU 0)

Block amd-smi per-block DME per-block DME total Health AFID Result
GFX (b=2) UE 0→1 gpu_ecc_uncorrect_gfx 0→1 uncorrect_total 0→1 1→0 30 (FATAL) PASS
MMHUB (b=3) UE 0→1 gpu_ecc_uncorrect_mmhub 0→1 uncorrect_total 1→2 0→0 30 (FATAL) PASS
PCIe BIF (b=5) N/A (no HW counters) gpu_ecc_uncorrect_bif 0→0 unchanged unchanged SKIP
XGMI/WAFL (b=7) UE 0→0 gpu_ecc_uncorrect_xgmi_wafl 0→0 unchanged unchanged FAIL

Block Risk Classification

Block Risk Behavior
GFX Safe Reliably works, no GPU reset
MMHUB Safe Works when GPU is not recovering from prior injection
PCIe BIF No HW counters amd-smi reports N/A — hardware doesn't expose ECC counters
XGMI/WAFL Unreliable Injection accepted but counters don't increment
UMC Fatal reset All GPUs enter "resuming", requires FullPowerCycle
SDMA Fatal reset Same as UMC

Key Learnings

  • FullPowerCycle (not GracefulRestart) required for EINJ activation
  • OEM Redfish path varies by firmware: Oem.EINJState vs Oem.Ami.AMD.EINJState
  • PCIe BIF requires -s 1 -m 2 flags (sub-block 1, method ecrc_tx)
  • Sequential injection is critical — stacking causes GPU resets that clear counters

Results

Plan: docs-internal/knowledge/plans/2026-06-24-bmc-einj-enable-skill.md
Plan: docs-internal/knowledge/plans/2026-06-24-ras-inject-test-skill.md

Test plan

  • /bmc-einj-enable on SMCI BMC — EINJ enabled on all 8 OAM slots
  • /ras-inject-test GFX block — PASS (amd-smi + DME match)
  • /ras-inject-test MMHUB block — PASS
  • Documented PCIe BIF (SKIP) and XGMI/WAFL (FAIL) limitations
  • Confluence report created with parent/child page structure

🤖 Generated with Claude Code

Cherrypick triggered by: ACP-Automation

…jection testing (#1557)

* feat: add bmc-einj-enable and ras-inject-test skills for RAS error injection testing

Add two new Claude Code skills for hardware RAS error injection testing:

1. /bmc-einj-enable — enables EINJ on AMD GPU OAM slots via BMC Redfish API,
   discovers OAM slots, POSTs enable action, issues FullPowerCycle, and verifies
   activation. Supports SMCI BMCs. Tested on MI350X.

2. /ras-inject-test — runs end-to-end RAS error injection tests using amdgpuras,
   cross-validates ECC counters between amd-smi (ground truth) and the device
   metrics exporter, collects AFID data, generates structured test reports, and
   publishes results to Confluence under a release-based page hierarchy.

Both skills were validated on smci350-rck-g03-b19-03 (MI350X, 8 GPUs):
- GFX and MMHUB blocks: PASS (amd-smi and DME counters match)
- PCIe BIF: SKIP (hardware doesn't expose ECC counters)
- XGMI/WAFL: FAIL (injection accepted but counters don't increment)
- UMC/SDMA: excluded (cause fatal GPU resets)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: address Copilot review — placeholder creds, plan alignment, timeout

- Replace literal IPs/passwords with placeholders in all examples
- Add sshpass security note recommending SSHPASS env var for production
- Use discovered Redfish reset target URI instead of hardcoded path
- Add timeout to amd-smi metric --ecc-block example command
- Align plans with implementation: FullPowerCycle (not GracefulRestart),
  both OEM paths, Confluence publishing is in-scope (optional)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
(cherry picked from commit 42b51df8c88fde01787cc6961f2c5dae60b5511f)
@ci-penbot-01

Copy link
Copy Markdown
Contributor Author

AI-Assisted Cherry-Pick

Source PR: #1557
Target Branch: main

The cherry-pick operation encountered merge conflicts which were resolved automatically using AI assistance.

Files with conflicts (resolved by AI):

  • .claude/skills/README.md:394-444
Original conflict in .claude/skills/README.md
<<<<<<< HEAD
(file deleted in HEAD)
=======
### Hardware / BMC

#### `/bmc-einj-enable`

**File**: `bmc-einj-enable/SKILL.md` | **Component**: BMC / RAS

Enable EINJ (Error Injection) on AMD GPU OAM slots via BMC Redfish API. Pre-requisite for `amdgpuras` RAS error injection testing.

**Use cases**:

- Enable EINJ on SMCI BMC before running `amdgpuras` RAS tests
- Discover which OAM slot(s) support EINJ
- Power cycle the host via Redfish to activate EINJ
- Verify EINJ state after reboot

**Supported BMC vendors**: SMCI (Supermicro) — other vendors may not expose the OEM endpoint.

**Example**:

```bash
/bmc-einj-enable <BMC_IP> <USERNAME> <PASSWORD>

/ras-inject-test

File: ras-inject-test/SKILL.md | Component: RAS / ECC Testing

Run end-to-end RAS error injection tests on AMD GPUs. Injects errors via amdgpuras, verifies ECC counters in amd-smi (ground truth) and device-metrics-exporter, collects AFID data, and generates a test report.

Use cases:

  • Validate DME ECC metric accuracy against amd-smi after hardware error injection
  • Test all injectable blocks (UMC, SDMA, GFX, MMHUB, PCIe, XGMI) across all GPUs
  • Collect AFID data and correlate with gpu_afid_errors metric
  • Generate structured test reports for Confluence upload

Prerequisites: EINJ enabled (/bmc-einj-enable), amdgpuras installed, DME running on host.

Example:

/ras-inject-test <HOST_IP> <USERNAME> <PASSWORD>
/ras-inject-test <HOST_IP> <USERNAME> <PASSWORD> --release v1.5.2
/ras-inject-test <HOST_IP> <USERNAME> <PASSWORD> --release v1.5.1 --blocks gfx,mmhub,pcie_bif

42b51df8 (feat: add bmc-einj-enable and ras-inject-test skills for RAS error injection testing (#1557))

</details>


*Cherry-pick triggered by: ACP-Automation*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants