Add pure TP configuration to H200 vLLM DSv4 deployment by ywang96 · Pull Request #1285 · SemiAnalysisAI/InferenceX

ywang96 · 2026-05-05T20:39:39Z

No description provided.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-05T20:39:49Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-05-05T20:47:10Z

 --no-enable-prefix-caching \
--enable-expert-parallel \
--data-parallel-size $TP \
+--tensor-parallel-size $TP \


🔴 This PR makes a major recipe change (EP+DP=8 → pure TP=8) on dsv4-fp8-h200-vllm and dsv4-fp8-h200-vllm-mtp, but perf-changelog.yaml is not updated. Per AGENTS.md, the changelog is the trigger for benchmark re-runs, so without a new entry push-to-main will not re-benchmark these configs and inferencex.com will keep publishing the old EP+DP numbers. Please append a new entry covering both config-keys to the END of perf-changelog.yaml (matching the pattern from #1279/#1222).

Extended reasoning...

What the bug is: This PR modifies benchmarks/single_node/dsv4_fp8_h200.sh and benchmarks/single_node/dsv4_fp8_h200_mtp.sh to swap the parallelism strategy from expert-parallel + data-parallel size 8 to pure tensor-parallel size 8 — a structural recipe change that will materially change throughput, latency, and GPU memory characteristics. However, no entry is added to perf-changelog.yaml.

Why this matters / code path: AGENTS.md is explicit:

Line 161-162: "perf-changelog.yaml triggers which configs to benchmark ... New entries MUST be appended to the END of the file — never insert in the middle or prepend."

Line 174: "Changes to perf-changelog.yaml trigger benchmark runs"

Lines 192-195 (Adding a New Benchmark Configuration): "Add corresponding entry to perf-changelog.yaml to trigger benchmark"

Lines 295-300 (Updating Docker Images / parameter changes): "MUST: Add an entry to perf-changelog.yaml"

The run-sweep.yml workflow only fires off post-merge benchmark sweeps in response to changes to perf-changelog.yaml. With no entry, push-to-main will not re-run the sweep for dsv4-fp8-h200-vllm or dsv4-fp8-h200-vllm-mtp, so the published dashboard will still reflect numbers measured under the old EP+DP recipe even after this code lands.

Strong precedent: The directly preceding PR #1279 added a changelog entry for an arguably much smaller change — bumping num_speculative_tokens from 1 → 2 on this same config. The tail of perf-changelog.yaml shows that entry (and #1222's MTP-add entry just before it). A switch from EP+DP=8 to pure TP=8 is a materially larger recipe shift than a single token-budget bump, so if #1279 warranted an entry, this PR definitely does.

Step-by-step proof:

git diff main...HEAD --stat for this PR shows only the two .sh files changed; perf-changelog.yaml is untouched.

Push-to-main triggers run-sweep.yml, which (per its paths: filter) only invokes the benchmark matrix when perf-changelog.yaml is in the changed-paths list.

Since this PR doesn't touch perf-changelog.yaml, the post-merge job will skip both dsv4-fp8-h200-vllm and dsv4-fp8-h200-vllm-mtp.

Result: inferencex.com continues serving the old EP+DP numbers despite the merged code now running pure TP=8 — a silent staleness that is exactly what the changelog mechanism is designed to prevent.

How to fix: Append a new entry to the END of perf-changelog.yaml (per AGENTS.md, never insert in the middle), e.g.:

- config-keys: - dsv4-fp8-h200-vllm - dsv4-fp8-h200-vllm-mtp description: - "Switch H200 vLLM DSv4 recipe from EP + DP=8 to pure TP=8" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1285

This will cause the post-merge sweep to re-benchmark both configs under the new recipe and refresh the published numbers.

claude · 2026-05-05T20:47:10Z

 --no-enable-prefix-caching \
--enable-expert-parallel \
--data-parallel-size $TP \
+--tensor-parallel-size $TP \


🔴 The PR switches both dsv4_fp8_h200.sh and dsv4_fp8_h200_mtp.sh to pure TP, but .github/configs/nvidia-master.yaml lines 2627, 2631 (dsv4-fp8-h200-vllm) and lines 2649, 2653 (dsv4-fp8-h200-vllm-mtp) still specify { tp: 8, ep: 8, dp-attn: true, ... }. Those values are exported as EP_SIZE/DP_ATTENTION by benchmark-tmpl.yml, embedded in RESULT_FILENAME as ep8-dpaTrue, and recorded by utils/process_result.py as the ep and dp_attention metadata fields — so pure-TP runs will be silently mislabeled and aggregated under the old EP+DP-attn label. Update both search-space entries to ep: 1, dp-attn: false (or drop those keys) in this PR.

Extended reasoning...

Bug

After this PR, the script no longer passes --enable-expert-parallel or --data-parallel-size $TP; it just passes --tensor-parallel-size $TP. So actual deployment is pure TP=8, EP effectively 1, DP-attention disabled.

However, .github/configs/nvidia-master.yaml was not updated. Lines 2614–2653 still configure both DSv4 H200 vLLM entries with EP+DP-attn search-space:

dsv4-fp8-h200-vllm: ... scenarios: fixed-seq-len: - isl: 1024 osl: 1024 search-space: - { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64 } - isl: 8192 osl: 1024 search-space: - { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64 } dsv4-fp8-h200-vllm-mtp: ... search-space: - { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64, spec-decoding: mtp } ... - { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64, spec-decoding: mtp }

How it manifests

.github/workflows/benchmark-tmpl.yml lines 103–105 export the search-space fields as env vars:

TP: ${{ inputs.tp }} EP_SIZE: ${{ inputs.ep }} DP_ATTENTION: ${{ inputs.dp-attn }}

Line 180 builds the result filename by interpolating those env vars verbatim:

RESULT_FILENAME: ${{ env.EXP_NAME }}_${{ env.PRECISION }}_${{ env.FRAMEWORK }}_tp${{ env.TP }}-ep${{ env.EP_SIZE }}-dpa${{ env.DP_ATTENTION }}_disagg-...

utils/process_result.py lines 110–119 then reads those env vars as required and writes them into the result-metadata JSON:

single_node_env = get_required_env_vars(['TP', 'EP_SIZE', 'DP_ATTENTION']) tp_size = int(single_node_env['TP']) ep_size = int(single_node_env['EP_SIZE']) dp_attention = single_node_env['DP_ATTENTION'] single_node_data = { 'is_multinode': False, 'tp': tp_size, 'ep': ep_size, 'dp_attention': dp_attention, ... }

Step-by-step proof

CI workflow consumes nvidia-master.yaml and dispatches a dsv4-fp8-h200-vllm job with inputs tp=8, ep=8, dp-attn=true.

benchmark-tmpl.yml exports TP=8, EP_SIZE=8, DP_ATTENTION=true into the job environment.

The launcher invokes benchmarks/single_node/dsv4_fp8_h200.sh, which after this PR runs vllm serve ... --tensor-parallel-size 8 only — no EP, no DP-attention. The EP_SIZE and DP_ATTENTION env vars are now ignored by the script.

RESULT_FILENAME is computed as ..._tp8-ep8-dpaTrue_disagg-....

utils/process_result.py writes { tp: 8, ep: 8, dp_attention: "true", ... } into the result JSON despite the actual run being pure TP.

Aggregated/published TP-only numbers are filed under an EP+DP-attn label, contaminating downstream comparison.

Why existing code doesn't prevent it

process_result.py has no way to discover what flags vllm serve was actually launched with — it just trusts the env vars from the search-space. The script itself silently ignores EP_SIZE and DP_ATTENTION now (since it no longer references them), so there is no error or warning. The previous commit message "switch from DP+EP to pure TP" matches the diff but missed the YAML side of the configuration.

Fix

Update both search-space entries in .github/configs/nvidia-master.yaml to match the new deployment:

dsv4-fp8-h200-vllm: ... search-space: - { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64 } ... - { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64 } dsv4-fp8-h200-vllm-mtp: ... search-space: - { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64, spec-decoding: mtp } ... - { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64, spec-decoding: mtp }

This should be done in the same PR — otherwise the first run after merge produces mislabeled ep8-dpaTrue artifacts that are hard to correct retroactively.

Re-introduces the DP-attention + expert-parallel path on top of pure TP by reading DP_ATTENTION/EP_SIZE env vars (matching dsv4_fp4_b200_vllm.sh). nvidia-master.yaml now sweeps both {tp:8,ep:1,dp-attn:false} and {tp:8,ep:8,dp-attn:true}, with conc-end raised from 64 to 256. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ywang96 · 2026-05-05T21:33:53Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp8-h200-vllm

github-actions · 2026-05-05T21:34:10Z

@ywang96 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25403559794
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp8-h200-vllm
Pinned ref: 06bf7c4
Approval: not required (trusted collaborator).

cquil11

lgtm

…" (#1286) This reverts commit 366ffbc.

#1285)"" This reverts commit 2d8547c.

The original PR #1285 landed without this changelog entry; it was reverted in #1286 and is being re-landed here together with the missing entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ntry (#1287) * Revert "Revert "Add pure TP configuration to H200 vLLM DSv4 deployment (#1285)"" This reverts commit 2d8547c. * Add perf-changelog entry for dsv4-fp8-h200-vllm{,-mtp} pure-TP config The original PR #1285 landed without this changelog entry; it was reverted in #1286 and is being re-landed here together with the missing entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dsv4-fp8-h200-vllm: switch from DP+EP to pure TP

c3d0789

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ywang96 requested a review from a team May 5, 2026 20:39

github-project-automation Bot added this to InferenceMAX Board May 5, 2026

ywang96 added the full-sweep-enabled label May 5, 2026

claude Bot reviewed May 5, 2026

View reviewed changes

ywang96 requested review from jgangani and kedarpotdar-nv as code owners May 5, 2026 21:29

ywang96 changed the title ~~Switch H200 vLLM DSv4 deployment to pure TP~~ Add pure TP configuration to H200 vLLM DSv4 deployment May 5, 2026

cquil11 approved these changes May 5, 2026

View reviewed changes

update

701ea13

cquil11 merged commit 366ffbc into main May 5, 2026
3 checks passed

cquil11 deleted the dsv4-fp8-h200-vllm-pure-tp branch May 5, 2026 23:28

github-project-automation Bot moved this to Done in InferenceMAX Board May 5, 2026

This was referenced May 5, 2026

Revert #1285: original PR missing perf-changelog entry #1286

Merged

Re-PR #1285: pure TP H200 vLLM DSv4 config + missing perf-changelog entry #1287

Merged

cquil11 added a commit that referenced this pull request May 6, 2026

Revert "Add pure TP configuration to H200 vLLM DSv4 deployment (#1285)…

ed03adc

…" (#1286) This reverts commit 366ffbc.

cquil11 added a commit that referenced this pull request May 6, 2026

Revert "Revert "Add pure TP configuration to H200 vLLM DSv4 deployment (

1e06afa

#1285)"" This reverts commit 2d8547c.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pure TP configuration to H200 vLLM DSv4 deployment#1285

Add pure TP configuration to H200 vLLM DSv4 deployment#1285
cquil11 merged 3 commits intomainfrom
dsv4-fp8-h200-vllm-pure-tp

ywang96 commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

claude Bot May 5, 2026

Uh oh!

claude Bot May 5, 2026

Uh oh!

ywang96 commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

cquil11 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ywang96 commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

claude Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 5, 2026

Choose a reason for hiding this comment

Bug

How it manifests

Step-by-step proof

Why existing code doesn't prevent it

Fix

Uh oh!

ywang96 commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

cquil11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants