feat: add vLLM + LMCache CPU offloading for MiniMax-M2.5 agentic benchmark on AMD GPUs#1262
feat: add vLLM + LMCache CPU offloading for MiniMax-M2.5 agentic benchmark on AMD GPUs#1262
Conversation
…hmark on AMD GPUs Add `offloading: lmcache` as a new KV cache offloading option for the agentic trace replay benchmark on MI300X/MI325X/MI355X. LMCache offloads cold KV cache pages to CPU DRAM via LMCacheConnectorV1, enabling larger working sets than HBM-only prefix caching. - Add benchmark scripts for MiniMax-M2.5 FP8 on MI300X/MI325X/MI355X - Add install_lmcache_hip() helper (PyPI wheel is CUDA-only, must build from source) - Extend offloading Literal to include "lmcache" in validation - Add agentic-coding scenarios to AMD master config with lmcache/none sweeps - Add 21 new tests for agentic + LMCache validation - Bump MiniMax vLLM images to v0.19.1 Smoke-tested on MI300X (TP=2) and MI355X (TP=4) with MiniMax-M2.7. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| def test_cannot_mix_tp_and_prefill(self): | ||
| """Cannot specify both tp and prefill/decode.""" | ||
| with pytest.raises(Exception): | ||
| AgenticCodingSearchSpaceEntry(**{ | ||
| "tp": 8, | ||
| "prefill": { | ||
| "num-worker": 1, "tp": 4, "ep": 4, "dp-attn": False, | ||
| }, | ||
| "decode": { | ||
| "num-worker": 1, "tp": 8, "ep": 8, "dp-attn": False, | ||
| }, | ||
| "conc-list": [4], | ||
| }) |
There was a problem hiding this comment.
@andyluo7 i don't think this claude test case is accurate
pretty sure vllm & sglang allows prefill tp being different from decode tp on most models
There was a problem hiding this comment.
You're right that vLLM/SGLang support different prefill vs decode TP. This test is just verifying the existing validation logic from PR #1201 (validate_topology_fields in AgenticCodingSearchSpaceEntry), not something we introduced. Updated the docstring to clarify this and note that a future PR could relax the constraint.
There was a problem hiding this comment.
Claude finished @functionstackx's task in 2m 21s —— View job
Fixed validate_topology_fields
- Read current
validate_topology_fieldsimplementation - Read related test cases
- Fix validation to allow different prefill vs decode TP
- Update tests accordingly
- Run tests to verify (174/174 pass)
- Commit and push
Changes in e7ab020c:
validation.py — Relaxed validate_topology_fields to allow tp + prefill/decode to coexist (for disaggregated prefill/decode serving with different TP values). The new rules are:
- If
prefillis specified,decodemust also be specified (and vice versa) - At least one of
tporprefill+decodemust be provided
test_validation.py — Replaced test_cannot_mix_tp_and_prefill (which expected failure) with three new tests:
test_tp_with_prefill_decode_allowed— verifiestp+prefill/decodecoexist with different TP valuestest_prefill_without_decode_rejected— verifies only-prefill is rejectedtest_decode_without_prefill_rejected— verifies only-decode is rejected
| dp_attn: bool = Field(alias=Fields.DP_ATTN.value) | ||
| conc: int | ||
| offloading: Literal["none", "cpu", "ssd"] = Field(alias=Fields.OFFLOADING.value) | ||
| offloading: Literal["none", "cpu", "lmcache", "ssd"] = Field(alias=Fields.OFFLOADING.value) |
- Rename offloading value from "lmcache" to "lmcache_cpu" to distinguish from potential future LMCache backends (NVMe, WEKA, etc.) - Clarify test_cannot_mix_tp_and_prefill docstring: this tests existing validation behavior from PR #1201, not a new constraint. Note that a future PR may relax this to allow different prefill/decode TP values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…code TP vLLM and SGLang support different prefill vs decode TP on most models. The previous validation rejected entries specifying both tp and prefill/decode configs. Now tp can coexist with prefill/decode for disaggregated serving, while still requiring both prefill and decode if either is specified. Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
Summary
offloading: lmcacheas a new KV cache offloading option for the agentic trace replay benchmark, alongside existingnoneandcpu(native vLLM)install_lmcache_hip()helper that builds LMCache from source with HIP support (PyPI wheel is CUDA-only)Files changed
benchmarks/single_node/agentic/minimaxm2.5_fp8_mi{300x,325x,355x}.shbenchmarks/benchmark_lib.sh: Addedinstall_lmcache_hip()helperutils/matrix_logic/validation.py: ExtendedoffloadingLiteral to include"lmcache"utils/matrix_logic/test_validation.py: 21 new tests for agentic + LMCache validation.github/configs/amd-master.yaml: Addedagentic-codingscenarios withlmcache/nonesweeps; bumped images to v0.19.1LMCache on ROCm — critical settings
PyPI
lmcachewheel is CUDA-only — must build from source:git clone+SETUPTOOLS_SCM_PRETEND_VERSION=0.4.4 BUILD_WITH_HIP=1 pip install -e . --no-build-isolationHardware validation
Smoke-tested vLLM 0.19.1 + LMCache 0.4.4 + MiniMax-M2.7 (same architecture as M2.5):
Test plan
bash -nsyntax check on all 3 scriptslmcache/noneoffloading🤖 Generated with Claude Code