[CP 1557] feat: add bmc-einj-enable and ras-inject-test skills for RAS error injection testing#586
Open
ci-penbot-01 wants to merge 1 commit into
Conversation
…jection testing (#1557) * feat: add bmc-einj-enable and ras-inject-test skills for RAS error injection testing Add two new Claude Code skills for hardware RAS error injection testing: 1. /bmc-einj-enable — enables EINJ on AMD GPU OAM slots via BMC Redfish API, discovers OAM slots, POSTs enable action, issues FullPowerCycle, and verifies activation. Supports SMCI BMCs. Tested on MI350X. 2. /ras-inject-test — runs end-to-end RAS error injection tests using amdgpuras, cross-validates ECC counters between amd-smi (ground truth) and the device metrics exporter, collects AFID data, generates structured test reports, and publishes results to Confluence under a release-based page hierarchy. Both skills were validated on smci350-rck-g03-b19-03 (MI350X, 8 GPUs): - GFX and MMHUB blocks: PASS (amd-smi and DME counters match) - PCIe BIF: SKIP (hardware doesn't expose ECC counters) - XGMI/WAFL: FAIL (injection accepted but counters don't increment) - UMC/SDMA: excluded (cause fatal GPU resets) Co-Authored-By: Claude <noreply@anthropic.com> * fix: address Copilot review — placeholder creds, plan alignment, timeout - Replace literal IPs/passwords with placeholders in all examples - Add sshpass security note recommending SSHPASS env var for production - Use discovered Redfish reset target URI instead of hardcoded path - Add timeout to amd-smi metric --ecc-block example command - Align plans with implementation: FullPowerCycle (not GracefulRestart), both OEM paths, Confluence publishing is in-scope (optional) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> (cherry picked from commit 42b51df8c88fde01787cc6961f2c5dae60b5511f)
Contributor
Author
AI-Assisted Cherry-PickSource PR: #1557 The cherry-pick operation encountered merge conflicts which were resolved automatically using AI assistance. Files with conflicts (resolved by AI):
Original conflict in .claude/skills/README.md<<<<<<< HEAD
(file deleted in HEAD)
=======
### Hardware / BMC
#### `/bmc-einj-enable`
**File**: `bmc-einj-enable/SKILL.md` | **Component**: BMC / RAS
Enable EINJ (Error Injection) on AMD GPU OAM slots via BMC Redfish API. Pre-requisite for `amdgpuras` RAS error injection testing.
**Use cases**:
- Enable EINJ on SMCI BMC before running `amdgpuras` RAS tests
- Discover which OAM slot(s) support EINJ
- Power cycle the host via Redfish to activate EINJ
- Verify EINJ state after reboot
**Supported BMC vendors**: SMCI (Supermicro) — other vendors may not expose the OEM endpoint.
**Example**:
```bash
/bmc-einj-enable <BMC_IP> <USERNAME> <PASSWORD>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cp of pensando/gpu-operator#1557
Source PR Description (pensando/gpu-operator#1557):
Summary
Add two new Claude Code skills for hardware RAS error injection testing on AMD GPUs, with end-to-end validation against real hardware and Confluence reporting.
/bmc-einj-enable— BMC EINJ Enablement via RedfishAutomates the multi-step process of enabling Error Injection (EINJ) on AMD GPU OAM slots through the BMC Redfish API. This is a prerequisite before
amdgpurascan inject hardware RAS errors.What it does:
EINJStateOem.EINJStateandOem.Ami.AMD.EINJStatepaths (varies by firmware){"ErrInjection": "Enable"}to each disabled slotFullPowerCyclevia Redfish (GracefulRestart is NOT sufficient for EINJ activation)EINJState = "Enabled"after rebootSupported BMC vendors: SMCI (Supermicro) — confirmed working.
Invocation:
/bmc-einj-enable <BMC_IP> <USERNAME> <PASSWORD>/ras-inject-test— RAS Error Injection Testing with Cross-ValidationRuns end-to-end RAS error injection tests using
amdgpuras, cross-validates ECC counters betweenamd-smi(hardware ground truth) and the Device Metrics Exporter (DME), collects AFID data, and generates structured test reports published to Confluence.What it does:
amdgpuras -lamd-smi ras --cperand correlates withgpu_afid_errorsmetricCross-validation approach:
amd-smi metric --ecc-blockis the ground truth (per-block HW counters)gpu_ecc_uncorrect_<block>,gpu_ecc_uncorrect_total,gpu_health) are the system under testInvocation:
/ras-inject-test <HOST_IP> <USER> <PASS> [--release v1.5.1] [--blocks gfx,mmhub]Files Changed
.claude/skills/bmc-einj-enable/SKILL.md.claude/skills/ras-inject-test/SKILL.md.claude/commands/bmc-einj-enable.md.claude/commands/ras-inject-test.md.claude/skills/README.mddocs-internal/knowledge/plans/2026-06-24-bmc-einj-enable-skill.mddocs-internal/knowledge/plans/2026-06-24-ras-inject-test-skill.mdValidation — Live Hardware Testing
Tested on smci350-rck-g03-b19-03 (SMCI, MI350X, 8 GPUs, driver 6.16.13, ROCm 7.2.1, DME exporter-0.0.1-342).
EINJ Enablement
EINJStatechanged fromDisabled→pending to enable→Enabledafter FullPowerCycleRAS Injection Results (GPU 0)
gpu_ecc_uncorrect_gfx0→1uncorrect_total0→1gpu_ecc_uncorrect_mmhub0→1uncorrect_total1→2gpu_ecc_uncorrect_bif0→0gpu_ecc_uncorrect_xgmi_wafl0→0Block Risk Classification
Key Learnings
FullPowerCycle(notGracefulRestart) required for EINJ activationOem.EINJStatevsOem.Ami.AMD.EINJState-s 1 -m 2flags (sub-block 1, method ecrc_tx)Results
Plan: docs-internal/knowledge/plans/2026-06-24-bmc-einj-enable-skill.md
Plan: docs-internal/knowledge/plans/2026-06-24-ras-inject-test-skill.md
Test plan
/bmc-einj-enableon SMCI BMC — EINJ enabled on all 8 OAM slots/ras-inject-testGFX block — PASS (amd-smi + DME match)/ras-inject-testMMHUB block — PASS🤖 Generated with Claude Code
Cherrypick triggered by: ACP-Automation