Skip to content

dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2#1279

Merged
functionstackx merged 2 commits intomainfrom
claude/dsv4-h200-mtp-num-spec-tokens-2
May 5, 2026
Merged

dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2#1279
functionstackx merged 2 commits intomainfrom
claude/dsv4-h200-mtp-num-spec-tokens-2

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

Bump num_speculative_tokens from 1 to 2 in the dsv4-fp8-h200-vllm-mtp recipe.

  • Single change: --speculative-config '{"method":"mtp","num_speculative_tokens":1}''{"method":"mtp","num_speculative_tokens":2}' in benchmarks/single_node/dsv4_fp8_h200_mtp.sh.
  • Companion comment in the script header is updated to match.
  • Adds a perf-changelog.yaml entry to trigger the new sweep.

Context

The original PR (#1222) landed at num_speculative_tokens=1 because the recipe says H200 MTP kernels only support 1 draft token. Inferact folks indicated H200 kernels now support 2, and Blackwell MTP runs at 2 per @wzhao18's vLLM Blackwell MTP submission. This PR re-tests that assumption on H200. If acceptance / throughput look healthy, we keep 2; otherwise we revert.

Test plan

  • Trigger the dsv4-fp8-h200-vllm-mtp benchmark workflow on an H200 runner and confirm the engine starts and the sweep completes for at least one cell from each of the two seq-len-configs.
  • server.log shows --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.
  • Acceptance rate is in a sane range and throughput at the same concurrency points beats the num_speculative_tokens=1 numbers from the Add DSv4 FP8 H200 vLLM MTP benchmark #1222 sweep — otherwise the kernel-support assumption was wrong and we revert.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

functionstackx added a commit that referenced this pull request May 4, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread perf-changelog.yaml
description:
- "Bump --speculative-config num_speculative_tokens from 1 to 2 (`{\"method\":\"mtp\",\"num_speculative_tokens\":2}`)"
- "Re-test whether H200 MTP kernels accept 2 draft tokens — Blackwell MTP runs at 2 (per @wzhao18's vLLM Blackwell MTP submission); checking if H200 has parity now"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1279
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new perf-changelog.yaml entry for this PR has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER on line 2216 — the seed placeholder from the AGENTS.md template was never replaced with the real PR number (1279). Clicking through the changelog entry to find the originating PR returns a GitHub 404, defeating the audit-trail purpose of the file. Trivial one-token fix on line 2216: PLACEHOLDER1279.

Extended reasoning...

What the bug is

The new entry appended to perf-changelog.yaml for the dsv4-fp8-h200-vllm-mtp sweep ends with:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER

The rendered diff text in the GitHub UI happens to show pull/1279, but the actual committed content at HEAD (026143c) on line 2216 still says PLACEHOLDER. tail -1 perf-changelog.yaml and git show HEAD -- perf-changelog.yaml | tail -1 both confirm this directly.

Why the placeholder is there

AGENTS.md (around lines 281 and 304) instructs authors to seed new changelog entries with pull/XXX-style placeholder text and substitute the real PR number after pushing. The author of this PR followed the template but skipped the substitution step.

Why nothing caught it

utils/matrix_logic/validation.py:483 declares pr_link as a plain str, with no URL/format/regex constraint:

pr_link: str

So PLACEHOLDER validates fine. There is no other test or lint that would have rejected it.

How rare this is

This is the only non-numeric pr-link in the entire 252-entry file:

$ grep -n 'pr-link' perf-changelog.yaml | grep -v 'pull/[0-9]'
2216:  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER

Every other one of the ~250 entries points at a real numeric PR.

Step-by-step proof

  1. Reviewer opens perf-changelog.yaml at HEAD after this PR merges.
  2. They scroll to the bottom (line 2216) — the entry for dsv4-fp8-h200-vllm-mtp sweep [NVIDIA] Add TRT 70B (FP8 and FP4) #2.
  3. They click the pr-link to find the PR that introduced this sweep.
  4. Browser navigates to https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER.
  5. GitHub returns a 404 — PLACEHOLDER is not a numeric PR id.
  6. The reviewer cannot trace the perf delta back to the originating PR via the changelog, which is precisely the workflow the changelog exists to support.

Impact

Not a runtime/behavior bug — the benchmark recipe itself is untouched. But the changelog entry is broken-as-merged: the audit-trail link 404s, and the entry sits next to ~250 well-formed siblings as the lone outlier. Easy to miss once it lands on main.

Fix

One-token edit on line 2216 of perf-changelog.yaml:

-  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1279

(Optional follow-up, out of scope for this PR: tighten pr_link in utils/matrix_logic/validation.py:483 to require a numeric PR id, e.g. pydantic.HttpUrl + a regex check or a custom validator that asserts the path ends with /pull/<int>. That would prevent the same template-substitution miss from sneaking past in the future.)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

functionstackx and others added 2 commits May 4, 2026 23:29
Tests whether H200 MTP kernels accept 2 draft tokens. Blackwell
runs at 2 per @wzhao18's vLLM Blackwell MTP submission; checking
if H200 has parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/dsv4-h200-mtp-num-spec-tokens-2 branch from a95421e to db615ed Compare May 5, 2026 03:29
@functionstackx
Copy link
Copy Markdown
Contributor Author

thanks to inferact vllm community maintainers for fixing hopper top_k kernel and merging my pr for recipes vllm-project/recipes#435

@functionstackx functionstackx merged commit 9d5bbb8 into main May 5, 2026
22 checks passed
@functionstackx functionstackx deleted the claude/dsv4-h200-mtp-num-spec-tokens-2 branch May 5, 2026 03:31
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

xiaohuguo2023 pushed a commit to xiaohuguo2023/InferenceX that referenced this pull request May 6, 2026
…isAI#1279)

* dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2

Tests whether H200 MTP kernels accept 2 draft tokens. Blackwell
runs at 2 per @wzhao18's vLLM Blackwell MTP submission; checking
if H200 has parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf-changelog: fill in PR SemiAnalysisAI#1279 link

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant