Skip to content

feat: add vLLM + LMCache CPU offloading for MiniMax-M2.5 agentic benchmark on AMD GPUs#1262

Draft
andyluo7 wants to merge 3 commits intomainfrom
feat/lmcache-agentic-amd
Draft

feat: add vLLM + LMCache CPU offloading for MiniMax-M2.5 agentic benchmark on AMD GPUs#1262
andyluo7 wants to merge 3 commits intomainfrom
feat/lmcache-agentic-amd

Conversation

@andyluo7
Copy link
Copy Markdown
Collaborator

@andyluo7 andyluo7 commented May 2, 2026

Summary

  • Add offloading: lmcache as a new KV cache offloading option for the agentic trace replay benchmark, alongside existing none and cpu (native vLLM)
  • Add MiniMax-M2.5 FP8 agentic benchmark scripts for MI300X, MI325X, and MI355X
  • Add install_lmcache_hip() helper that builds LMCache from source with HIP support (PyPI wheel is CUDA-only)

Files changed

  • 3 new scripts: benchmarks/single_node/agentic/minimaxm2.5_fp8_mi{300x,325x,355x}.sh
  • benchmarks/benchmark_lib.sh: Added install_lmcache_hip() helper
  • utils/matrix_logic/validation.py: Extended offloading Literal to include "lmcache"
  • utils/matrix_logic/test_validation.py: 21 new tests for agentic + LMCache validation
  • .github/configs/amd-master.yaml: Added agentic-coding scenarios with lmcache/none sweeps; bumped images to v0.19.1

LMCache on ROCm — critical settings

PYTHONHASHSEED=0              # mandatory: cache key consistency across TP workers
LMCACHE_LOCAL_CPU=true        # enable CPU DRAM offload tier
LMCACHE_CHUNK_SIZE=256        # token granularity
--enable-prefix-caching       # LMCache reuses vLLM's prefix cache hash function
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

PyPI lmcache wheel is CUDA-only — must build from source: git clone + SETUPTOOLS_SCM_PRETEND_VERSION=0.4.4 BUILD_WITH_HIP=1 pip install -e . --no-build-isolation

Hardware validation

Smoke-tested vLLM 0.19.1 + LMCache 0.4.4 + MiniMax-M2.7 (same architecture as M2.5):

  • MI300X (8x gfx942, TP=2): LMCache c_ops HIP backend loads, server healthy, coherent output, prefix reuse confirmed
  • MI355X (8x gfx950, TP=4): LMCache c_ops HIP backend loads, server healthy, coherent output, prefix reuse confirmed

Test plan

  • 172 unit tests pass (82 validation + 90 sweep config)
  • bash -n syntax check on all 3 scripts
  • Full AMD master config validation (47 entries)
  • Sweep generation produces correct matrix with lmcache/none offloading
  • MI300X smoke test: vLLM + LMCache + MiniMax-M2.7 TP=2
  • MI355X smoke test: vLLM + LMCache + MiniMax-M2.7 TP=4
  • Full 30-min trace replay benchmark (requires CI runner)

🤖 Generated with Claude Code

…hmark on AMD GPUs

Add `offloading: lmcache` as a new KV cache offloading option for the
agentic trace replay benchmark on MI300X/MI325X/MI355X. LMCache offloads
cold KV cache pages to CPU DRAM via LMCacheConnectorV1, enabling larger
working sets than HBM-only prefix caching.

- Add benchmark scripts for MiniMax-M2.5 FP8 on MI300X/MI325X/MI355X
- Add install_lmcache_hip() helper (PyPI wheel is CUDA-only, must build from source)
- Extend offloading Literal to include "lmcache" in validation
- Add agentic-coding scenarios to AMD master config with lmcache/none sweeps
- Add 21 new tests for agentic + LMCache validation
- Bump MiniMax vLLM images to v0.19.1

Smoke-tested on MI300X (TP=2) and MI355X (TP=4) with MiniMax-M2.7.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment thread utils/matrix_logic/test_validation.py Outdated
Comment on lines +958 to +970
def test_cannot_mix_tp_and_prefill(self):
"""Cannot specify both tp and prefill/decode."""
with pytest.raises(Exception):
AgenticCodingSearchSpaceEntry(**{
"tp": 8,
"prefill": {
"num-worker": 1, "tp": 4, "ep": 4, "dp-attn": False,
},
"decode": {
"num-worker": 1, "tp": 8, "ep": 8, "dp-attn": False,
},
"conc-list": [4],
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyluo7 i don't think this claude test case is accurate

pretty sure vllm & sglang allows prefill tp being different from decode tp on most models

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that vLLM/SGLang support different prefill vs decode TP. This test is just verifying the existing validation logic from PR #1201 (validate_topology_fields in AgenticCodingSearchSpaceEntry), not something we introduced. Updated the docstring to clarify this and note that a future PR could relax the constraint.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cquil11 can u fix validate_topology_fields

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude fix validate_topology_fields

Copy link
Copy Markdown
Collaborator

@Klaud-Cold Klaud-Cold May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @functionstackx's task in 2m 21s —— View job


Fixed validate_topology_fields

  • Read current validate_topology_fields implementation
  • Read related test cases
  • Fix validation to allow different prefill vs decode TP
  • Update tests accordingly
  • Run tests to verify (174/174 pass)
  • Commit and push

Changes in e7ab020c:

validation.py — Relaxed validate_topology_fields to allow tp + prefill/decode to coexist (for disaggregated prefill/decode serving with different TP values). The new rules are:

  1. If prefill is specified, decode must also be specified (and vice versa)
  2. At least one of tp or prefill+decode must be provided

test_validation.py — Replaced test_cannot_mix_tp_and_prefill (which expected failure) with three new tests:

  • test_tp_with_prefill_decode_allowed — verifies tp + prefill/decode coexist with different TP values
  • test_prefill_without_decode_rejected — verifies only-prefill is rejected
  • test_decode_without_prefill_rejected — verifies only-decode is rejected

Comment thread utils/matrix_logic/validation.py Outdated
dp_attn: bool = Field(alias=Fields.DP_ATTN.value)
conc: int
offloading: Literal["none", "cpu", "ssd"] = Field(alias=Fields.OFFLOADING.value)
offloading: Literal["none", "cpu", "lmcache", "ssd"] = Field(alias=Fields.OFFLOADING.value)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyluo7 should this be named as lmcache_cpu

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — renamed to lmcache_cpu in 603fbb8 to leave room for future backends (NVMe, WEKA, etc.).

andyluo7 and others added 2 commits May 2, 2026 12:33
- Rename offloading value from "lmcache" to "lmcache_cpu" to distinguish
  from potential future LMCache backends (NVMe, WEKA, etc.)
- Clarify test_cannot_mix_tp_and_prefill docstring: this tests existing
  validation behavior from PR #1201, not a new constraint. Note that a
  future PR may relax this to allow different prefill/decode TP values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…code TP

vLLM and SGLang support different prefill vs decode TP on most models.
The previous validation rejected entries specifying both tp and
prefill/decode configs. Now tp can coexist with prefill/decode for
disaggregated serving, while still requiring both prefill and decode
if either is specified.

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants