Add pure TP configuration to H200 vLLM DSv4 deployment#1285
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| --no-enable-prefix-caching \ | ||
| --enable-expert-parallel \ | ||
| --data-parallel-size $TP \ | ||
| --tensor-parallel-size $TP \ |
There was a problem hiding this comment.
🔴 This PR makes a major recipe change (EP+DP=8 → pure TP=8) on dsv4-fp8-h200-vllm and dsv4-fp8-h200-vllm-mtp, but perf-changelog.yaml is not updated. Per AGENTS.md, the changelog is the trigger for benchmark re-runs, so without a new entry push-to-main will not re-benchmark these configs and inferencex.com will keep publishing the old EP+DP numbers. Please append a new entry covering both config-keys to the END of perf-changelog.yaml (matching the pattern from #1279/#1222).
Extended reasoning...
What the bug is: This PR modifies benchmarks/single_node/dsv4_fp8_h200.sh and benchmarks/single_node/dsv4_fp8_h200_mtp.sh to swap the parallelism strategy from expert-parallel + data-parallel size 8 to pure tensor-parallel size 8 — a structural recipe change that will materially change throughput, latency, and GPU memory characteristics. However, no entry is added to perf-changelog.yaml.
Why this matters / code path: AGENTS.md is explicit:
- Line 161-162: "
perf-changelog.yamltriggers which configs to benchmark ... New entries MUST be appended to the END of the file — never insert in the middle or prepend." - Line 174: "Changes to
perf-changelog.yamltrigger benchmark runs" - Lines 192-195 (Adding a New Benchmark Configuration): "Add corresponding entry to
perf-changelog.yamlto trigger benchmark" - Lines 295-300 (Updating Docker Images / parameter changes): "MUST: Add an entry to
perf-changelog.yaml"
The run-sweep.yml workflow only fires off post-merge benchmark sweeps in response to changes to perf-changelog.yaml. With no entry, push-to-main will not re-run the sweep for dsv4-fp8-h200-vllm or dsv4-fp8-h200-vllm-mtp, so the published dashboard will still reflect numbers measured under the old EP+DP recipe even after this code lands.
Strong precedent: The directly preceding PR #1279 added a changelog entry for an arguably much smaller change — bumping num_speculative_tokens from 1 → 2 on this same config. The tail of perf-changelog.yaml shows that entry (and #1222's MTP-add entry just before it). A switch from EP+DP=8 to pure TP=8 is a materially larger recipe shift than a single token-budget bump, so if #1279 warranted an entry, this PR definitely does.
Step-by-step proof:
git diff main...HEAD --statfor this PR shows only the two.shfiles changed;perf-changelog.yamlis untouched.- Push-to-main triggers
run-sweep.yml, which (per itspaths:filter) only invokes the benchmark matrix whenperf-changelog.yamlis in the changed-paths list. - Since this PR doesn't touch
perf-changelog.yaml, the post-merge job will skip bothdsv4-fp8-h200-vllmanddsv4-fp8-h200-vllm-mtp. - Result: inferencex.com continues serving the old EP+DP numbers despite the merged code now running pure TP=8 — a silent staleness that is exactly what the changelog mechanism is designed to prevent.
How to fix: Append a new entry to the END of perf-changelog.yaml (per AGENTS.md, never insert in the middle), e.g.:
- config-keys:
- dsv4-fp8-h200-vllm
- dsv4-fp8-h200-vllm-mtp
description:
- "Switch H200 vLLM DSv4 recipe from EP + DP=8 to pure TP=8"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1285This will cause the post-merge sweep to re-benchmark both configs under the new recipe and refresh the published numbers.
| --no-enable-prefix-caching \ | ||
| --enable-expert-parallel \ | ||
| --data-parallel-size $TP \ | ||
| --tensor-parallel-size $TP \ |
There was a problem hiding this comment.
🔴 The PR switches both dsv4_fp8_h200.sh and dsv4_fp8_h200_mtp.sh to pure TP, but .github/configs/nvidia-master.yaml lines 2627, 2631 (dsv4-fp8-h200-vllm) and lines 2649, 2653 (dsv4-fp8-h200-vllm-mtp) still specify { tp: 8, ep: 8, dp-attn: true, ... }. Those values are exported as EP_SIZE/DP_ATTENTION by benchmark-tmpl.yml, embedded in RESULT_FILENAME as ep8-dpaTrue, and recorded by utils/process_result.py as the ep and dp_attention metadata fields — so pure-TP runs will be silently mislabeled and aggregated under the old EP+DP-attn label. Update both search-space entries to ep: 1, dp-attn: false (or drop those keys) in this PR.
Extended reasoning...
Bug
After this PR, the script no longer passes --enable-expert-parallel or --data-parallel-size $TP; it just passes --tensor-parallel-size $TP. So actual deployment is pure TP=8, EP effectively 1, DP-attention disabled.
However, .github/configs/nvidia-master.yaml was not updated. Lines 2614–2653 still configure both DSv4 H200 vLLM entries with EP+DP-attn search-space:
dsv4-fp8-h200-vllm:
...
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64 }
dsv4-fp8-h200-vllm-mtp:
...
search-space:
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64, spec-decoding: mtp }
...
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64, spec-decoding: mtp }How it manifests
.github/workflows/benchmark-tmpl.yml lines 103–105 export the search-space fields as env vars:
TP: ${{ inputs.tp }}
EP_SIZE: ${{ inputs.ep }}
DP_ATTENTION: ${{ inputs.dp-attn }}Line 180 builds the result filename by interpolating those env vars verbatim:
RESULT_FILENAME: ${{ env.EXP_NAME }}_${{ env.PRECISION }}_${{ env.FRAMEWORK }}_tp${{ env.TP }}-ep${{ env.EP_SIZE }}-dpa${{ env.DP_ATTENTION }}_disagg-...utils/process_result.py lines 110–119 then reads those env vars as required and writes them into the result-metadata JSON:
single_node_env = get_required_env_vars(['TP', 'EP_SIZE', 'DP_ATTENTION'])
tp_size = int(single_node_env['TP'])
ep_size = int(single_node_env['EP_SIZE'])
dp_attention = single_node_env['DP_ATTENTION']
single_node_data = {
'is_multinode': False,
'tp': tp_size,
'ep': ep_size,
'dp_attention': dp_attention,
...
}Step-by-step proof
- CI workflow consumes
nvidia-master.yamland dispatches adsv4-fp8-h200-vllmjob with inputstp=8, ep=8, dp-attn=true. benchmark-tmpl.ymlexportsTP=8,EP_SIZE=8,DP_ATTENTION=trueinto the job environment.- The launcher invokes
benchmarks/single_node/dsv4_fp8_h200.sh, which after this PR runsvllm serve ... --tensor-parallel-size 8only — no EP, no DP-attention. TheEP_SIZEandDP_ATTENTIONenv vars are now ignored by the script. RESULT_FILENAMEis computed as..._tp8-ep8-dpaTrue_disagg-....utils/process_result.pywrites{ tp: 8, ep: 8, dp_attention: "true", ... }into the result JSON despite the actual run being pure TP.- Aggregated/published TP-only numbers are filed under an EP+DP-attn label, contaminating downstream comparison.
Why existing code doesn't prevent it
process_result.py has no way to discover what flags vllm serve was actually launched with — it just trusts the env vars from the search-space. The script itself silently ignores EP_SIZE and DP_ATTENTION now (since it no longer references them), so there is no error or warning. The previous commit message "switch from DP+EP to pure TP" matches the diff but missed the YAML side of the configuration.
Fix
Update both search-space entries in .github/configs/nvidia-master.yaml to match the new deployment:
dsv4-fp8-h200-vllm:
...
search-space:
- { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64 }
...
- { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64 }
dsv4-fp8-h200-vllm-mtp:
...
search-space:
- { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64, spec-decoding: mtp }
...
- { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64, spec-decoding: mtp }This should be done in the same PR — otherwise the first run after merge produces mislabeled ep8-dpaTrue artifacts that are hard to correct retroactively.
Re-introduces the DP-attention + expert-parallel path on top of pure TP
by reading DP_ATTENTION/EP_SIZE env vars (matching dsv4_fp4_b200_vllm.sh).
nvidia-master.yaml now sweeps both {tp:8,ep:1,dp-attn:false} and
{tp:8,ep:8,dp-attn:true}, with conc-end raised from 64 to 256.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp8-h200-vllm |
|
@ywang96 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25403559794 |
…ntry (#1287) * Revert "Revert "Add pure TP configuration to H200 vLLM DSv4 deployment (#1285)"" This reverts commit 2d8547c. * Add perf-changelog entry for dsv4-fp8-h200-vllm{,-mtp} pure-TP config The original PR #1285 landed without this changelog entry; it was reverted in #1286 and is being re-landed here together with the missing entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No description provided.