feat: add vLLM + LMCache CPU offloading for MiniMax-M2.5 agentic benchmark on AMD GPUs by andyluo7 · Pull Request #1262 · SemiAnalysisAI/InferenceX

andyluo7 · 2026-05-02T19:10:20Z

Summary

Add offloading: lmcache as a new KV cache offloading option for the agentic trace replay benchmark, alongside existing none and cpu (native vLLM)
Add MiniMax-M2.5 FP8 agentic benchmark scripts for MI300X, MI325X, and MI355X
Add install_lmcache_hip() helper that builds LMCache from source with HIP support (PyPI wheel is CUDA-only)

Files changed

3 new scripts: benchmarks/single_node/agentic/minimaxm2.5_fp8_mi{300x,325x,355x}.sh
benchmarks/benchmark_lib.sh: Added install_lmcache_hip() helper
utils/matrix_logic/validation.py: Extended offloading Literal to include "lmcache"
utils/matrix_logic/test_validation.py: 21 new tests for agentic + LMCache validation
.github/configs/amd-master.yaml: Added agentic-coding scenarios with lmcache/none sweeps; bumped images to v0.19.1

LMCache on ROCm — critical settings

PYTHONHASHSEED=0              # mandatory: cache key consistency across TP workers
LMCACHE_LOCAL_CPU=true        # enable CPU DRAM offload tier
LMCACHE_CHUNK_SIZE=256        # token granularity
--enable-prefix-caching       # LMCache reuses vLLM's prefix cache hash function
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

PyPI lmcache wheel is CUDA-only — must build from source: git clone + SETUPTOOLS_SCM_PRETEND_VERSION=0.4.4 BUILD_WITH_HIP=1 pip install -e . --no-build-isolation

Hardware validation

Smoke-tested vLLM 0.19.1 + LMCache 0.4.4 + MiniMax-M2.7 (same architecture as M2.5):

MI300X (8x gfx942, TP=2): LMCache c_ops HIP backend loads, server healthy, coherent output, prefix reuse confirmed
MI355X (8x gfx950, TP=4): LMCache c_ops HIP backend loads, server healthy, coherent output, prefix reuse confirmed

Test plan

172 unit tests pass (82 validation + 90 sweep config)
bash -n syntax check on all 3 scripts
Full AMD master config validation (47 entries)
Sweep generation produces correct matrix with lmcache/none offloading
MI300X smoke test: vLLM + LMCache + MiniMax-M2.7 TP=2
MI355X smoke test: vLLM + LMCache + MiniMax-M2.7 TP=4
Full 30-min trace replay benchmark (requires CI runner)

🤖 Generated with Claude Code

…hmark on AMD GPUs Add `offloading: lmcache` as a new KV cache offloading option for the agentic trace replay benchmark on MI300X/MI325X/MI355X. LMCache offloads cold KV cache pages to CPU DRAM via LMCacheConnectorV1, enabling larger working sets than HBM-only prefix caching. - Add benchmark scripts for MiniMax-M2.5 FP8 on MI300X/MI325X/MI355X - Add install_lmcache_hip() helper (PyPI wheel is CUDA-only, must build from source) - Extend offloading Literal to include "lmcache" in validation - Add agentic-coding scenarios to AMD master config with lmcache/none sweeps - Add 21 new tests for agentic + LMCache validation - Bump MiniMax vLLM images to v0.19.1 Smoke-tested on MI300X (TP=2) and MI355X (TP=4) with MiniMax-M2.7. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-05-02T19:10:29Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

functionstackx · 2026-05-02T19:28:43Z

+    def test_cannot_mix_tp_and_prefill(self):
+        """Cannot specify both tp and prefill/decode."""
+        with pytest.raises(Exception):
+            AgenticCodingSearchSpaceEntry(**{
+                "tp": 8,
+                "prefill": {
+                    "num-worker": 1, "tp": 4, "ep": 4, "dp-attn": False,
+                },
+                "decode": {
+                    "num-worker": 1, "tp": 8, "ep": 8, "dp-attn": False,
+                },
+                "conc-list": [4],
+            })


@andyluo7 i don't think this claude test case is accurate

pretty sure vllm & sglang allows prefill tp being different from decode tp on most models

You're right that vLLM/SGLang support different prefill vs decode TP. This test is just verifying the existing validation logic from PR #1201 (validate_topology_fields in AgenticCodingSearchSpaceEntry), not something we introduced. Updated the docstring to clarify this and note that a future PR could relax the constraint.

@cquil11 can u fix validate_topology_fields

@claude fix validate_topology_fields

Claude finished @functionstackx's task in 2m 21s —— View job

Fixed validate_topology_fields

Read current validate_topology_fields implementation

Read related test cases

Fix validation to allow different prefill vs decode TP

Update tests accordingly

Run tests to verify (174/174 pass)

Commit and push

Changes in e7ab020c:

validation.py — Relaxed validate_topology_fields to allow tp + prefill/decode to coexist (for disaggregated prefill/decode serving with different TP values). The new rules are:

If prefill is specified, decode must also be specified (and vice versa)

At least one of tp or prefill+decode must be provided

test_validation.py — Replaced test_cannot_mix_tp_and_prefill (which expected failure) with three new tests:

test_tp_with_prefill_decode_allowed — verifies tp + prefill/decode coexist with different TP values

test_prefill_without_decode_rejected — verifies only-prefill is rejected

test_decode_without_prefill_rejected — verifies only-decode is rejected

functionstackx · 2026-05-02T19:30:26Z

    dp_attn: bool = Field(alias=Fields.DP_ATTN.value)
    conc: int
-    offloading: Literal["none", "cpu", "ssd"] = Field(alias=Fields.OFFLOADING.value)
+    offloading: Literal["none", "cpu", "lmcache", "ssd"] = Field(alias=Fields.OFFLOADING.value)


@andyluo7 should this be named as lmcache_cpu

Good call — renamed to lmcache_cpu in 603fbb8 to leave room for future backends (NVMe, WEKA, etc.).

- Rename offloading value from "lmcache" to "lmcache_cpu" to distinguish from potential future LMCache backends (NVMe, WEKA, etc.) - Clarify test_cannot_mix_tp_and_prefill docstring: this tests existing validation behavior from PR #1201, not a new constraint. Note that a future PR may relax this to allow different prefill/decode TP values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…code TP vLLM and SGLang support different prefill vs decode TP on most models. The previous validation rejected entries specifying both tp and prefill/decode configs. Now tp can coexist with prefill/decode for disaggregated serving, while still requiring both prefill and decode if either is specified. Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>

github-project-automation Bot added this to InferenceMAX Board May 2, 2026

functionstackx reviewed May 2, 2026

View reviewed changes

andyluo7 and others added 2 commits May 2, 2026 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add vLLM + LMCache CPU offloading for MiniMax-M2.5 agentic benchmark on AMD GPUs#1262

feat: add vLLM + LMCache CPU offloading for MiniMax-M2.5 agentic benchmark on AMD GPUs#1262
andyluo7 wants to merge 3 commits intomainfrom
feat/lmcache-agentic-amd

andyluo7 commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

functionstackx May 2, 2026

Uh oh!

andyluo7 May 2, 2026

Uh oh!

functionstackx May 2, 2026

Uh oh!

functionstackx May 2, 2026

Uh oh!

Klaud-Cold May 2, 2026 •

edited

Loading

Uh oh!

functionstackx May 2, 2026

Uh oh!

andyluo7 May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andyluo7 commented May 2, 2026

Summary

Files changed

LMCache on ROCm — critical settings

Hardware validation

Test plan

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

functionstackx May 2, 2026

Choose a reason for hiding this comment

Uh oh!

andyluo7 May 2, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx May 2, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Klaud-Cold May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Fixed validate_topology_fields

Uh oh!

functionstackx May 2, 2026

Choose a reason for hiding this comment

Uh oh!

andyluo7 May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Klaud-Cold May 2, 2026 •

edited

Loading

Fixed `validate_topology_fields`