Conversation
Adds end-to-end agentic-coding benchmark infrastructure on top of the
existing fixed-seq-len harness. New components:
Trace replayer
- New utils/trace-replay submodule (kv-cache-tester @ agentx-minimized)
driving multi-turn HF-dataset traces against any OpenAI-compatible
endpoint at fixed concurrency.
- --debug-trace captures full per-request prompt/response, every
streamed chunk via chunk.model_dump(), and integer token IDs
(apply_chat_template prompt + logprobs.content completion) into
debug_trace.jsonl.
- Per-model delta-field abstraction (gpt-oss → delta.reasoning, default
→ delta.reasoning_content) so reasoning-heavy responses are counted
and appended to conversation history correctly.
- Input-token metric reads server's usage.prompt_tokens (authoritative)
rather than the local apply_chat_template estimate which breaks for
gpt-oss harmony's chat template.
- Per-user 8-token salt prefix on conversation[0] so two in-flight
users replaying the same trace_id don't accidentally share KV-cache
blocks.
- Period summary: counts up elapsed instead of down remaining; replaces
the dispatch-jitter "Wait time" with the trace's true "Inter-turn
time" sourced from RequestMetrics.delay_expected.
- 5s quiesce between warmup completion and metrics-collector start so
warmup-tail prefill doesn't bleed into period 1.
Workflow plumbing
- e2e-tests.yml: workflow_dispatch + workflow_call inputs for
debug-trace (boolean) and duration-override (string seconds), forwarded
to test-sweep-agentic and test-sweep-multi-node-agentic jobs.
- benchmark-tmpl.yml + benchmark-multinode-tmpl.yml: debug-trace input
mapped to DEBUG_TRACE env var; duration override threads through to
matrix.config.duration.
- benchmark_lib.sh: build_replay_cmd / resolve_trace_source /
install_agentic_deps / write_agentic_result_json helpers; consumes
DEBUG_TRACE → --debug-trace.
- runners/launch_*.sh: shared agentic mode dispatch + scenario routing.
- runners/launch_b200-dgxc-slurm.sh → launch_b200-dgxc.sh rename to
match the actual runner.name observed by the workflow.
Result aggregation
- utils/agentic-benchmark/{bench,analysis,scripts}: metrics collector
(vllm/sglang Prometheus parsers), pareto plotter, per-config
distribution analyzer, sweep aggregator.
- utils/process_agentic_result.py: per-job results.json builder.
- utils/matrix_logic: agentic-coding scenario plumbing in
generate_sweep_configs.py + validation.py.
Examples (one per vendor)
- benchmarks/single_node/agentic/dsr1_fp4_b200.sh — NVIDIA.
- benchmarks/single_node/agentic/dsr1_fp4_mi355x.sh — AMD.
- Matching agentic-coding sections in nvidia-master.yaml
(dsr1-fp4-b200-sglang) and amd-master.yaml (dsr1-fp4-mi355x-sglang).
All other model-specific launchers and matrix entries are deliberately
left out of this PR; downstream PRs add them on a per-model basis.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same value, two names — collapse to one. Workflow templates already
exposed both CONC and USERS env vars (USERS was a mirror of inputs.conc),
and the agentic matrix entries carried both `users: int` and
`conc: [users]`. Drop the duplicates and standardize on conc/CONC:
- benchmark-tmpl.yml / benchmark-multinode-tmpl.yml: drop redundant
USERS env var (CONC remains)
- e2e-tests.yml / run-sweep.yml: pass `conc: ${{ matrix.config.conc }}`
to template; build agentic conc-list as `'[${{ matrix.config.conc }}]'`
since matrix.config.conc is now a scalar
- generate_sweep_configs.py: agentic entries emit Fields.CONC.value (int)
only; loop variable renamed from `users` to `conc`; exp-name template
now uses `_conc{N}` instead of `_users{N}`
- validation.py: drop Fields.USERS; agentic Pydantic models use `conc: int`
- process_agentic_result.py: read CONC env var, emit single `"conc"` key
- collect_sweep_results.py: regex updated to match `_conc{N}_offload`
- benchmark_lib.sh / agentic launcher scripts: $USERS → $CONC
The trace-replayer's --start-users / --max-users CLI flags are upstream's
API and are left unchanged; benchmark_lib.sh just passes $CONC into them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pick up these submodule commits (callanjfox/kv-cache-tester): - 7b7f883 silence kimi: target the actual loaded-tokenizer module logger - 5b87e43 silence kimi: replace static logger lookup with content filter - 3394450 silence Kimi tokenization_kimi.py per-call encode warning - 7ad6a9e delta-field map: add 'kimi' substring (uses delta.reasoning like gpt-oss) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 new agentic-coding launcher scripts brought over from chore/agentx-integration, with USERS → CONC normalization: - benchmarks/single_node/agentic/gptoss_fp4_h100.sh - benchmarks/single_node/agentic/gptoss_fp4_h200.sh - benchmarks/single_node/agentic/gptoss_fp4_mi300x.sh - benchmarks/single_node/agentic/gptoss_fp4_mi325x.sh - benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings utils/agentic-benchmark/analysis/ (plot_pareto.py — sweep visualizer for cross-config performance comparison) and updates requirements.txt with transformers/xlsxwriter/tqdm/datasets/tiktoken needed by the analyzer + by trace-replay's tokenizer paths. The bench/ directory is intentionally NOT added: bench/metrics_collector.py duplicated utils/trace-replay/server_metrics.py and was already removed on this branch; bench/run_metrics_collector.py depends on it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds agentic-coding scenario blocks to the master configs for the five models whose launchers were just brought over: - kimik2.5-fp4-b200-vllm (image bumped to v0.19.1) - gptoss-fp4-h100-vllm - gptoss-fp4-h200-vllm - gptoss-fp4-mi300x-vllm - gptoss-fp4-mi325x-vllm Each scenario sweeps tp 4/8 (and 1/2 on AMD/H200) at offloading=none for low/mid concurrency and offloading=cpu for high concurrency, with a crossover at conc=64. Other agentic-coding sections present on chore/agentx-integration (trtllm/srt-slurm based) are left for follow-up since several of the underlying model entries were restructured by main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agentic-coding scenario type uses benchmarks/single_node/agentic/ launchers, gated by SCENARIO_SUBDIR='agentic/' from benchmark-tmpl.yml. b200-cw, b200-dgxc, b200-nb, and b300-nv all built BENCH_BASE without honoring SCENARIO_SUBDIR, so dispatch always landed in single_node/ even for agentic runs. Other runners (h100-*, h200-*, mi*) already had this plumbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…H200 - minimaxm2.5-fp8-b200-vllm - qwen3.5-bf16-b200-sglang - glm5-fp8-b200-sglang - dsv4-fp8-h200-vllm Each launcher mirrors its fixed-seq-len sibling but: uses CONC env for max-num-seqs / cuda-graph-max-bs, sources benchmark_lib.sh, calls the trace replayer via build_replay_cmd, and emits the agentic result JSON. Master config gets an agentic-coding scenario block sweeping conc 1..32 at offloading=none; b200-dsv4 entries left untouched since that runner type isn't registered in runners.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- minimaxm2.5-fp8-mi355x-vllm - qwen3.5-fp8-mi355x-sglang - glm5.1-fp4-mi355x-sglang - kimik2.5-fp4-mi355x-vllm Each mirrors its fixed-seq-len sibling with ROCm-specific tweaks (VLLM_ROCM_USE_AITER, ROCM_QUICK_REDUCE_QUANTIZATION, etc.) and feeds CONC into max-num-seqs / cuda-graph-max-bs. Master configs gain matching agentic-coding scenarios sweeping conc 1..32 at offloading=none. dsv4-fp8-mi355x is intentionally skipped since the existing fixed-seq launcher requires a bespoke vLLM PR rebuild that adds risk to trace-replayer testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…5-fp4 Phase-2 coverage extension across precision (int4 vs fp4 for kimi, fp4 vs fp8 for minimax) and runner (b200 vs h100/h200 for gptoss). - gptoss-fp4-b200-vllm - kimik2.5-int4-b200-vllm - minimaxm2.5-fp4-b200-vllm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bf16 image lmsysorg/sglang:nightly-dev-20260216-d3bae71e fails on B200 with PyTorch/CuDNN compatibility errors at server start. Add an fp8 variant using lmsysorg/sglang:v0.5.9-cu130-amd64 to provide a working qwen3.5 trace-replayer test on NVIDIA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the launcher matrix at benchmarks/single_node/agentic/, how to dispatch debug runs via gh workflow run, and what fields in the result JSON to inspect for verification (num_requests_successful, total_generation_tokens, median_ttft, median_tpot, total_tput_tps, etc.). Notes the two known-failing configs (qwen3.5 sglang on B200 — pytorch/ pytorch#168167; dsv4-fp4-b200-sglang — runner b200-dsv4 not in runners.yaml) so future testers don't repeat them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 debug runs across 7 model families × NVIDIA/AMD HW. 10 PASS / 5 FAIL (1 still in flight); failures are all image- or vLLM-parser-level, not replayer bugs. Replayer's per-model delta-field routing + long-prefill agentic flow verified end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 16 dispatched runs are now complete. Final tally: 10 PASS, 6 FAIL. The 6 failures are all infrastructure or vLLM-side issues (PyTorch/CuDNN image incompatibility, vLLM deepseek_v4 reasoning parser bug, sglang-rocm qwen3.5 streaming, SLURM time limit) — none indicate a bug in the trace replayer itself. All 7 active model families have at least one PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The exp-name template emits offload{none|cpu|ssd} (per the matrix
generator's f"{model_code}_tp{tp}_conc{conc}_offload{offloading}"),
but the regex was looking for offload(on|off) — so every artifact
directory failed to parse, the aggregator wrote nothing to aggregated/,
and collect-agentic-results uploaded no files ("No files were found
with the provided path: aggregated/").
Verified the fix matches real artifact names from this branch's runs
(b200/h100, none/cpu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For the 5 vllm models (kimik2.5-fp4/int4-b200, minimaxm2.5-fp8-b200,
gptoss-fp4-b200, kimik2.5-fp4-mi355x, minimaxm2.5-fp8-mi355x): add
offloading=cpu at high concurrency (typically conc 64+) where KV cache
pressure exceeds GPU HBM. Overlap at conc=64 between none and cpu so
the crossover region is sampled by both. cpu-offload sweep tail uses
larger conc points (96, 128, 192, 256) since the only reason to enable
cpu offload is when concurrency stresses HBM.
For glm5-fp8-b200-sglang and glm5.1-fp4-mi355x-sglang (sglang launchers
without the OFFLOADING=cpu plumbing): expand the conc range on
offloading=none. sglang manages its own KV eviction via the radix
cache, so concurrency above HBM capacity is handled internally rather
than via vLLM's --kv_offloading_backend.
dsr1-fp4-{b200,mi355x}-sglang sweeps already cover conc 1..256 (b200
also has tp=4 ep=4 / tp=8 ep=8 split and tp=8 going to conc=512), so
left as-is.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both nodes are currently dropping every job that lands on them: - NCCL barrier dies during sglang Scheduler.init_model_worker with RuntimeError: NCCL error: unhandled cuda error (stale CUDA contexts from a previous job that didn't tear down cleanly) - HuggingFace CAS download for moonshotai/Kimi-K2.5 fails with RuntimeError: Data processing error: CAS service error : IO Error: No space left on device (os error 28) Adding --exclude=gpu-10,gpu-15 to salloc keeps SLURM from allocating to them. Drop this once sa-shared admins clean up the nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vLLM's OffloadingConnector (--kv_offloading_backend native) is incompatible with the hybrid-KV-cache-manager (HMA) for models with mixed attention layouts. When HMA is enabled, the OffloadingConnector init fails with: RuntimeError: Worker failed with error 'Connector OffloadingConnector does not support HMA but HMA is enabled. Please set --disable-hybrid-kv-cache-manager'. This bit kimik2.5-fp4-mi355x's full sweep: every offload=cpu sub-job failed with the above error while every offload=none sub-job passed (see run 25117841192). Kimi-K2.5 uses hybrid attention so HMA kicks in. MiniMax-M2.5 doesn't, which is why its prior cpu-offload sweeps passed even with the broken flag. Switching all 11 cpu-offload launchers to --disable-hybrid-kv-cache-manager is correctness-safe across the board: HMA is a pure optimization, and disabling it is required for OffloadingConnector regardless of model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nfigs
KV offloading via OffloadingConnector hits multiple upstream bugs on
older vllm tags:
- v0.15.1 (gpt-oss-fp4-b200, kimi-int4-b200): flashinfer kv_cache_permute
assertion in TRTLLM-attention path
- v0.18.0-rocm (kimi-fp4-mi355x): HMA + OffloadingConnector incompat
- v0.19.0 (minimaxm2.5-fp8 b200/mi355x): not yet verified clean
Bumping to v0.19.1 (or v0.19.1-rocm) — proven-good on kimi-fp4-b200
(23/23 sweep PASS) and gptoss-fp4 h100/h200/mi300x/mi325x.
Add agentic-coding sections + launchers for MiniMax-M2.5 FP8 across
H100, H200, B200, B300, MI300X, MI355X (excluding MI325X). Conc ranges
sized from per-SKU GPU KV cache capacity:
KV per token (fp8, 62 layers × 8 KV heads × 128 dim × 2): ~124 KB
Per-SKU GPU cache cap with tp=4 + 0.90 mem-util:
H100 58 GB -> 0.46M tok (saturate ~conc 6)
H200 277 GB -> 2.19M tok (saturate ~conc 29)
B200 461 GB -> 3.63M tok (saturate ~conc 48)
B300 807 GB -> 6.35M tok (saturate ~conc 85)
MI300X 500 GB -> 3.93M tok (saturate ~conc 52)
MI355X 864 GB -> 6.81M tok (saturate ~conc 91)
NVIDIA configs include offload=cpu starting at the saturation point
(simple cpu offload via OffloadingConnector requires vllm ≥ 0.19.1).
AMD configs do not enable cpu offload — vllm simple offloading isn't
supported on the rocm build for these models. AMD pushes offload=none
to a higher conc to demonstrate where GPU cache saturates.
Image bumps: h100/h200/mi300x v0.18.0/v0.16.0 -> v0.19.1; b300
v0.19.0-cu130 -> v0.19.1.
vllm v0.19.1 fp8 quantization rejects tp=8 for MiniMax-M2.5: gate/up weight output_size 1536 / tp=8 = 192, not divisible by block_n=128. Same constraint at vllm/model_executor/layers/quantization/fp8.py:638. Per fixed-seq-len reference TPs: H100 tp=4 ep=4 (tp=8 ep=8 commented out in fixed-seq-len for fp8) H200 fixed-seq-len has only tp=8 (broken on v0.19.1 fp8); winging tp=4 B200 tp=4 (fixed-seq-len has tp=2,4; tp=2 too tight for agentic ISL) B300 tp=4 (primary; fixed-seq-len has tp=1,2,4 with various ep) MI300X tp=4 (fixed-seq-len has tp=2,4) MI355X tp=4 ep=4 (fixed-seq-len has tp=2 ep=2, tp=4 ep=4, tp=8 ep=8) Concurrency expanded across the saturation cliff for each SKU; cpu offload range extended to 384/512 on NVIDIA where applicable.
Per empirical compute ceilings observed in prior runs (mean in-flight reqs mid-test on each platform): H100 tp=4 ep=4 ceiling ~10 (KV cliff ~6 -> cpu zone 6-10) H200 tp=4 ceiling ~35 (KV cliff ~29 -> cpu zone 29-35) B200 tp=4 ceiling ~50 (KV cliff ~48 -> very narrow) B300 tp=4 ceiling ~60 (KV cliff ~85 -> compute saturates first) MI300X tp=4 ceiling ~20 (estimated) MI355X tp=4 ep=4 ceiling ~60 Previous conc lists (1..256, even up to 512) wasted 30-min slots on sub-jobs that just queue 200+ requests waiting on a server only running 4-50 in flight, leading to client-side 600s timeout cascades. New lists "creep up" to 2-3x the ceiling, then stop. NVIDIA cpu offload range narrowed to the zone between KV cliff and compute ceiling, where offloading can actually relieve KV pressure without compute already being the bottleneck. AMD (mi300x, mi355x) keeps offload=none only.
Per user feedback: past the compute ceiling, throughput plateaus and extra conc just adds queue depth and client timeouts -- wasted slots. Reallocate sampling budget to densify around the cliff(s) for each SKU. Per-SKU strategy (compute ceiling empirical, KV cliff analytical): H100 tp=4 ep=4 ceil 10 KV 6 -> dense 4-12 (sweet spot for cpu demo) H200 tp=4 ceil 35 KV 29 -> dense 24-40 (narrow cpu window) B200 tp=4 ceil 50 KV 48 -> dense 32-56 (cliffs colocated) B300 tp=4 ceil 60 KV 85 -> dense 48-72 (compute first; cpu won't help) MI300X tp=4 ceil 25 KV 52 -> dense 16-32 (compute first; AMD no cpu) MI355X tp=4 ep=4 ceil 60 KV 91 -> dense 48-72 (compute first; AMD no cpu) Dense step (every 4-8 conc) around the cliffs to resolve the inflection; sparse step (doubling) below the cliffs for baseline; one point ~1.3-1.5x ceiling to confirm plateau. NVIDIA cpu offload range overlaps with none from KV cliff to ~ceiling for direct same-conc comparison; doesn't extend past 1.3x ceiling.
- AMD launchers (mi300x, mi355x) drop VLLM_USE_SIMPLE_KV_OFFLOAD env var. SimpleCPUOffloadConnector isn't supported on rocm; native OffloadingConnector works (still passes --kv_offloading_backend native flag). - Add cpu offload entries to AMD master configs (mi300x, mi355x). - Add b300-p1 runner group (subset of b300 nodes 13-17 with the b300-p1 label) and target it from the b300 minimax config.
The agentic-coding benchmark IS a prefix-cache benchmark — the whole point is measuring KV reuse across multi-turn conversations and across users (with the per-user salt enabling deterministic prefix overlap). Disabling prefix caching defeats the entire purpose. Removed from 7 launchers that had it: dsv4_fp8_h200.sh gptoss_fp4_b200.sh (was in config.yaml) kimik2.5_fp4_mi355x.sh kimik2.5_int4_b200.sh minimaxm2.5_fp4_b200.sh minimaxm2.5_fp8_mi300x.sh minimaxm2.5_fp8_mi355x.sh vLLM defaults to prefix caching ON when no flag is passed.
ROCM_AITER_FA was the suspect for both: 1. Worker dies on cpu offload (gpt-oss using UNIFIED_ATTN works fine on the same launcher pattern + image) 2. Prefix-cache Prometheus counters never increment (observability gap on FA backend, while UNIFIED_ATTN reports correctly on mi300x) Swap to ROCM_AITER_UNIFIED_ATTN to test both fixes in one shot.
The cpu range needs full overlap with none past the KV cliff so the no-offload throughput collapse is visible at the same conc points where cpu offload sustains throughput. B200 tp=4 (KV cliff conc=48): none: [1,2,4,8,16,32,48,56,64,96,128] (was capped at 64) cpu: [48,56,64,96,128] (was capped at 64) B300 tp=4 (KV cliff conc=85): none: [1,2,4,8,16,32,48,64,96,128,192] (was capped at 96) cpu: [48,64,96,128,192] (was capped at 96) Past the cliff, the no-offload curve should collapse (recompute storm, client-side timeouts), while cpu-offload sustains the compute ceiling.
…0-dgxc - Add b200-dgxc runner pool (subset of b200 excluding b200-cw / b200-nb). - Switch minimax-fp8-b200-vllm runner from b200 to b200-dgxc. - Hardcode TOTAL_CPU_DRAM_GB=1500 in cpu branch of b200 launcher (1.95x HBM total at tp=4, comfortably above the 1.5x threshold so the offload tier doesn't hit a secondary cliff).
…ting # Conflicts: # .github/configs/amd-master.yaml # .github/configs/nvidia-master.yaml # .github/workflows/benchmark-multinode-tmpl.yml # .github/workflows/benchmark-tmpl.yml # benchmarks/single_node/agentic/dsr1_fp4_b200.sh # benchmarks/single_node/agentic/gptoss_fp4_h100.sh # benchmarks/single_node/agentic/gptoss_fp4_h200.sh # benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh # runners/launch_b200-dgxc.sh # runners/launch_b300-nv.sh # runners/launch_gb300-nv.sh # runners/launch_h100-dgxc-slurm.sh # runners/launch_h200-dgxc-slurm.sh # utils/agentic-benchmark/scripts/collect_sweep_results.py # utils/trace-replay
The merge with origin/main pulled in main's agentic-coding loop in generate_test_config_sweep alongside our pre-existing one — both blocks were byte-identical so every sub-job got emitted twice (e.g., b300 generated 60 entries instead of 30). Drop the duplicate block, restore the function's return statement that was lost in the dedup.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
…n b300-nv
Adds agentic trace replay configs and launchers for DeepSeek-V4-Pro fp4 on
B200 and B300 via vLLM, mirroring the fixed-seq-len recipe (tp=8 ep=1, no
DP-attn) at the low-conc range. Initial conc list [1..64] for none and
[16,32,64] for cpu offload; cpu DRAM defaults to 1.5 TB on B200 and 2.2 TB
on B300 in the launcher (overrides the workflow 600 GB default).
Switches dsv4-fp4-b200-vllm runner from b200-dsv4 (not in our runners.yaml)
to b200-dgxc to match the established minimax B200 pattern.
Also restores ${SCENARIO_SUBDIR} in launch_b300-nv.sh BENCH_BASE: the
post-revert main state landed without it after the v0.1 squash merge, so
agentic dispatch on B300 was resolving to benchmarks/single_node/ instead
of benchmarks/single_node/agentic/. The b200-dgxc launcher already had
this prefix; b300-nv was the asymmetry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…=8 EP=8) The first attempt OOM'd at vLLM startup on every conc=64 cpu-offload job (and would have on conc=32 cpu) because I used TP=8 EP=1 with FULL_AND_PIECEWISE + max-num-batched-tokens=2048 + max-cudagraph-capture-size=2048 (copied from the fixed-seq-len recipe). At TP=8 every layer's attention output goes through an NCCL all-reduce; cudagraph capture pre-allocated activation/all-reduce workspace proportional to max-batched-tokens × hidden_dim × layers, consuming ~134 GiB per rank on top of the ~134 GiB DSv4-Pro fp4 weight footprint (1.6T-total / 49B-active model, 800 GiB checkpoint). KV cache profiling then had nothing left to allocate. The official vLLM blog recipe for 8xB200/8xB300 (https://vllm.ai/blog/deepseek-v4) uses DP=8 + EP=8 instead — each rank does its own attention on its own sequences (no per-layer TP all-reduce) and the MoE all-to-all is the only collective. Smaller activation workspace at capture time → cudagraph + KV cache both fit. Switching to that layout: - both launchers: drop the TP/DP-attn branching, always --data-parallel-size $TP --enable-expert-parallel; drop the max-cudagraph-capture-size and max-num-batched-tokens overrides (recipe doesn't set them, defaults are fine for DP-only collectives); keep FULL_AND_PIECEWISE + custom_ops=["all"] per recipe; max-model-len pinned at 1M (full DSv4 context — recipe suggests 800K but user wants 1M tested). - nvidia-master.yaml: agentic-coding entries become tp=8 ep=8 dp-attn=true for both B200 and B300; image at the config-block level switches from v0.20.0-cu130 to deepseekv4-cu130 (the DSv4-tuned tag from the recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inned) Per user direction, stay on vllm/vllm-openai:v0.20.0-cu130 instead of the DSv4-tuned deepseekv4-cu130 tag from the blog recipe — that tag isn't currently pinned in this pipeline. Parallelism layout (DP=8 + EP=8) is unchanged from the prior commit since the OOM fix is what actually mattered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cpu-offload jobs hit a clean ValueError at vLLM startup on B300: 442.99 GiB KV cache is needed [for max_model_len=1M], which is larger than the available KV cache memory (104.74 GiB). [...] estimated maximum model length is 236288. The cause is in the warning right above: SimpleCPUOffloadConnector forces --disable-hybrid-kv-cache-manager, which switches off DSv4's per-layer KV compaction (the "drop KV outside the local sliding window" optimization that gives DSv4 its "10% of V3.2's KV per token at 1M" claim). Without HMA, every layer stores full per-token KV and the per-rank budget blows up well below 1M context. HMA is DSv4's intended long-context mechanism — leave KV management to it and skip cpu offload until upstream supports HMA + KV connector together. Re-introduce a cpu-offload sweep at lower max-model-len in a follow-up if a meaningful KV cliff appears in the offload=none data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-enables the cpu-offload path for DSv4-Pro on B200/B300 now that we understand SimpleCPUOffloadConnector (selected via VLLM_USE_SIMPLE_KV_OFFLOAD=1) already inherits SupportsHMA in v0.20.0 (PR #37160 by njhill, merged 2026-04-01). The earlier failure was caused by --disable-hybrid-kv-cache-manager in OFFLOAD_ARGS, which forced HMA off and made vLLM size the KV pool for full per-layer storage (442 GiB needed for 1M context vs 104 GiB available per rank). Changes: - Both launchers: drop --disable-hybrid-kv-cache-manager from cpu OFFLOAD_ARGS; add explicit --enable-prefix-caching and --no-disable-hybrid-kv-cache-manager to the vllm serve command (matches PR #37160's documented example). - nvidia-master.yaml: restore the offloading=cpu search-space entries on both dsv4-fp4-b200-vllm and dsv4-fp4-b300-vllm with conc-list [16, 32, 64], and rewrite the comment to reflect the actual mechanism rather than the prior (incorrect) "wait for upstream HMA + connector support" framing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…artitioned) The b200-dgxc cluster was re-partitioned: the old "gpu" partition no longer exists. salloc now rejects with "invalid partition specified: gpu", breaking every B200 single-node agentic dispatch. Current sinfo: cpu cpu-[0-2] all* cpu-[0-2] + gpu-1-* + gpu-2-* (default, mixed) gpu-1 gpu-1-[0-3,5-7,9] (8 idle, gpu-1-4 / gpu-1-8 drained) gpu-2 gpu-2-[0-9] (10 idle, none drained) Land on gpu-2 since it's a clean GPU-only pool with no drained nodes. Drop the --exclude=gpu-10,gpu-15 list — those node names were from the pre-repartition layout (now gpu-1-* / gpu-2-*) and no longer match anything on the cluster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-divides TOTAL_CPU_DRAM_GB by $TP (= DP size, since the launcher passes
--data-parallel-size $TP) so each DP engine ends up with its fair share.
Without this, each of the 8 DP engines independently torch.zeros + pin_tensor
its own ~1500/2200 GB region, blowing past the SLURM memory cgroup limit
(direct dmesg evidence on gpu-2-6: 7 separate VLLM::Worker_DP processes
OOM-killed in sequence by the cgroup OOM-killer at growing anon_rss values).
Root cause is in vllm v0.20.0:
- vllm/config/parallel.py defines world_size := TPxPP, with a separate
world_size_across_dp := TPxPPxDP property
- vllm/distributed/.../simple_cpu_offload_connector.py uses parallel_config
.world_size for the divide, picking up TPxPP only
- LMCacheConnector explicitly divides by num_kv_ranks (incl DP); Simple's
path does not — see vllm/config/vllm.py
So with DP=8 EP=8 TP=1, world_size=1 inside each engine, no DP-aware
adjustment, and each DP engine commits the full --kv_offloading_size value
to physical pinned host RAM.
Also temporarily removes the offloading=none agentic-coding search-space
entries on both dsv4-fp4-{b200,b300}-vllm — we already have that data from
Friday's runs (25234821661, 25234822495). The next dispatch will be
cpu-only to validate the host-budget fix without re-running the none cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ting # Conflicts: # .github/configs/nvidia-master.yaml
…ffload sizing Mirrors the fixed-seq-len recipe's parallelism options for the agentic sweep — pure TP for low-conc / interactivity, DEP (DP-attn + EP-MoE) for high-conc / throughput per the vLLM blog recipe — and adapts the cpu offload sizing logic to the connector's actual divide-by-world_size behavior: - DP-attn=true (DEP modes): each DP engine has parallel_config.world_size=1 (TP×PP only — see vllm/config/parallel.py docstring), so the connector's internal divide is a no-op and each DP engine independently torch.zeros + pin_tensor allocates the full --kv_offloading_size value. Pre-divide TOTAL_CPU_DRAM_GB by $TP (the DP size in this layout) so 8 DP engines × (TOTAL/8) keeps aggregate host commit ≈ TOTAL. - DP-attn=false (pure TP, TP+EP): single engine with world_size=TP. Pass the full TOTAL — the connector's internal divide gives TOTAL/TP per rank and PR #37206's TP-shared mmap keeps the aggregate at TOTAL. Restored conditional PARALLEL_ARGS / EP_ARGS in both launchers (we had removed them when simplifying to DEP-only). Now handles all three modes (pure TP, TP+EP, DEP) cleanly via the matrix's tp / ep / dp-attn fields. Sweep coverage: - B200 (16 jobs): TP=8 + DEP=8, each with both offloading modes - B300 (32 jobs): TP=4, TP=8, DEP=4, DEP=8, each with both offloading modes Conc lists are agentic-scaled (smaller than fixed-seq-len): pure-TP modes sweep [1..32], DEP modes sweep [16..128] (none) and [64..256] / [128..512] (cpu offload, where the larger CPU pool extends the working-set ceiling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No description provided.