dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2 by functionstackx · Pull Request #1279 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-04T23:25:54Z

Summary

Bump num_speculative_tokens from 1 to 2 in the dsv4-fp8-h200-vllm-mtp recipe.

Single change: --speculative-config '{"method":"mtp","num_speculative_tokens":1}' → '{"method":"mtp","num_speculative_tokens":2}' in benchmarks/single_node/dsv4_fp8_h200_mtp.sh.
Companion comment in the script header is updated to match.
Adds a perf-changelog.yaml entry to trigger the new sweep.

Context

The original PR (#1222) landed at num_speculative_tokens=1 because the recipe says H200 MTP kernels only support 1 draft token. Inferact folks indicated H200 kernels now support 2, and Blackwell MTP runs at 2 per @wzhao18's vLLM Blackwell MTP submission. This PR re-tests that assumption on H200. If acceptance / throughput look healthy, we keep 2; otherwise we revert.

Test plan

Trigger the dsv4-fp8-h200-vllm-mtp benchmark workflow on an H200 runner and confirm the engine starts and the sweep completes for at least one cell from each of the two seq-len-configs.
server.log shows --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.
Acceptance rate is in a sane range and throughput at the same concurrency points beats the num_speculative_tokens=1 numbers from the Add DSv4 FP8 H200 vLLM MTP benchmark #1222 sweep — otherwise the kernel-support assumption was wrong and we revert.

🤖 Generated with Claude Code

github-actions · 2026-05-04T23:26:03Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-05-04T23:32:17Z

+  description:
+    - "Bump --speculative-config num_speculative_tokens from 1 to 2 (`{\"method\":\"mtp\",\"num_speculative_tokens\":2}`)"
+    - "Re-test whether H200 MTP kernels accept 2 draft tokens — Blackwell MTP runs at 2 (per @wzhao18's vLLM Blackwell MTP submission); checking if H200 has parity now"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1279


🔴 The new perf-changelog.yaml entry for this PR has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER on line 2216 — the seed placeholder from the AGENTS.md template was never replaced with the real PR number (1279). Clicking through the changelog entry to find the originating PR returns a GitHub 404, defeating the audit-trail purpose of the file. Trivial one-token fix on line 2216: PLACEHOLDER → 1279.

Extended reasoning...

What the bug is

The new entry appended to perf-changelog.yaml for the dsv4-fp8-h200-vllm-mtp sweep ends with:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER

The rendered diff text in the GitHub UI happens to show pull/1279, but the actual committed content at HEAD (026143c) on line 2216 still says PLACEHOLDER. tail -1 perf-changelog.yaml and git show HEAD -- perf-changelog.yaml | tail -1 both confirm this directly.

Why the placeholder is there

AGENTS.md (around lines 281 and 304) instructs authors to seed new changelog entries with pull/XXX-style placeholder text and substitute the real PR number after pushing. The author of this PR followed the template but skipped the substitution step.

Why nothing caught it

utils/matrix_logic/validation.py:483 declares pr_link as a plain str, with no URL/format/regex constraint:

pr_link: str

So PLACEHOLDER validates fine. There is no other test or lint that would have rejected it.

How rare this is

This is the only non-numeric pr-link in the entire 252-entry file:

$ grep -n 'pr-link' perf-changelog.yaml | grep -v 'pull/[0-9]' 2216: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER

Every other one of the ~250 entries points at a real numeric PR.

Step-by-step proof

Reviewer opens perf-changelog.yaml at HEAD after this PR merges.

They scroll to the bottom (line 2216) — the entry for dsv4-fp8-h200-vllm-mtp sweep [NVIDIA] Add TRT 70B (FP8 and FP4) #2.

They click the pr-link to find the PR that introduced this sweep.

Browser navigates to https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER.

GitHub returns a 404 — PLACEHOLDER is not a numeric PR id.

The reviewer cannot trace the perf delta back to the originating PR via the changelog, which is precisely the workflow the changelog exists to support.

Impact

Not a runtime/behavior bug — the benchmark recipe itself is untouched. But the changelog entry is broken-as-merged: the audit-trail link 404s, and the entry sits next to ~250 well-formed siblings as the lone outlier. Easy to miss once it lands on main.

Fix

One-token edit on line 2216 of perf-changelog.yaml:

- pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER + pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1279

(Optional follow-up, out of scope for this PR: tighten pr_link in utils/matrix_logic/validation.py:483 to require a numeric PR id, e.g. pydantic.HttpUrl + a regex check or a custom validator that asserts the path ends with /pull/<int>. That would prevent the same template-substitution miss from sneaking past in the future.)

github-actions · 2026-05-05T00:26:11Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25349104528
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25349104528

github-actions · 2026-05-05T01:33:01Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25349104528
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25349104528

@wzhao18

Tests whether H200 MTP kernels accept 2 draft tokens. Blackwell runs at 2 per @wzhao18's vLLM Blackwell MTP submission; checking if H200 has parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx · 2026-05-05T03:31:48Z

thanks to inferact vllm community maintainers for fixing hopper top_k kernel and merging my pr for recipes vllm-project/recipes#435

github-actions · 2026-05-05T04:17:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25356292856
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25356292856

@wzhao18

…isAI#1279) * dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2 Tests whether H200 MTP kernels accept 2 draft tokens. Blackwell runs at 2 per @wzhao18's vLLM Blackwell MTP submission; checking if H200 has parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: fill in PR SemiAnalysisAI#1279 link Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team May 4, 2026 23:25

github-project-automation Bot added this to InferenceMAX Board May 4, 2026

functionstackx added a commit that referenced this pull request May 4, 2026

perf-changelog: fill in PR #1279 link

a95421e

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx added the full-sweep-enabled label May 4, 2026

claude Bot reviewed May 4, 2026

View reviewed changes

functionstackx and others added 2 commits May 4, 2026 23:29

dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2

562ac10

Tests whether H200 MTP kernels accept 2 draft tokens. Blackwell runs at 2 per @wzhao18's vLLM Blackwell MTP submission; checking if H200 has parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

perf-changelog: fill in PR #1279 link

db615ed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx force-pushed the claude/dsv4-h200-mtp-num-spec-tokens-2 branch from a95421e to db615ed Compare May 5, 2026 03:29

functionstackx merged commit 9d5bbb8 into main May 5, 2026
22 checks passed

functionstackx deleted the claude/dsv4-h200-mtp-num-spec-tokens-2 branch May 5, 2026 03:31

github-project-automation Bot moved this to Done in InferenceMAX Board May 5, 2026

This was referenced May 5, 2026

Add pure TP configuration to H200 vLLM DSv4 deployment #1285

Merged

Re-PR #1285: pure TP H200 vLLM DSv4 config + missing perf-changelog entry #1287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2#1279

dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2#1279
functionstackx merged 2 commits intomainfrom
claude/dsv4-h200-mtp-num-spec-tokens-2

functionstackx commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

claude Bot May 4, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

functionstackx commented May 5, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 4, 2026

Summary

Context

Test plan

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

claude Bot May 4, 2026

Choose a reason for hiding this comment

What the bug is

Why the placeholder is there

Why nothing caught it

How rare this is

Step-by-step proof

Impact

Fix

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

functionstackx commented May 5, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant