dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2#1279
dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2#1279functionstackx merged 2 commits intomainfrom
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| description: | ||
| - "Bump --speculative-config num_speculative_tokens from 1 to 2 (`{\"method\":\"mtp\",\"num_speculative_tokens\":2}`)" | ||
| - "Re-test whether H200 MTP kernels accept 2 draft tokens — Blackwell MTP runs at 2 (per @wzhao18's vLLM Blackwell MTP submission); checking if H200 has parity now" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1279 |
There was a problem hiding this comment.
🔴 The new perf-changelog.yaml entry for this PR has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER on line 2216 — the seed placeholder from the AGENTS.md template was never replaced with the real PR number (1279). Clicking through the changelog entry to find the originating PR returns a GitHub 404, defeating the audit-trail purpose of the file. Trivial one-token fix on line 2216: PLACEHOLDER → 1279.
Extended reasoning...
What the bug is
The new entry appended to perf-changelog.yaml for the dsv4-fp8-h200-vllm-mtp sweep ends with:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDERThe rendered diff text in the GitHub UI happens to show pull/1279, but the actual committed content at HEAD (026143c) on line 2216 still says PLACEHOLDER. tail -1 perf-changelog.yaml and git show HEAD -- perf-changelog.yaml | tail -1 both confirm this directly.
Why the placeholder is there
AGENTS.md (around lines 281 and 304) instructs authors to seed new changelog entries with pull/XXX-style placeholder text and substitute the real PR number after pushing. The author of this PR followed the template but skipped the substitution step.
Why nothing caught it
utils/matrix_logic/validation.py:483 declares pr_link as a plain str, with no URL/format/regex constraint:
pr_link: strSo PLACEHOLDER validates fine. There is no other test or lint that would have rejected it.
How rare this is
This is the only non-numeric pr-link in the entire 252-entry file:
$ grep -n 'pr-link' perf-changelog.yaml | grep -v 'pull/[0-9]'
2216: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER
Every other one of the ~250 entries points at a real numeric PR.
Step-by-step proof
- Reviewer opens
perf-changelog.yamlatHEADafter this PR merges. - They scroll to the bottom (line 2216) — the entry for
dsv4-fp8-h200-vllm-mtpsweep [NVIDIA] Add TRT 70B (FP8 and FP4) #2. - They click the
pr-linkto find the PR that introduced this sweep. - Browser navigates to
https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER. - GitHub returns a 404 —
PLACEHOLDERis not a numeric PR id. - The reviewer cannot trace the perf delta back to the originating PR via the changelog, which is precisely the workflow the changelog exists to support.
Impact
Not a runtime/behavior bug — the benchmark recipe itself is untouched. But the changelog entry is broken-as-merged: the audit-trail link 404s, and the entry sits next to ~250 well-formed siblings as the lone outlier. Easy to miss once it lands on main.
Fix
One-token edit on line 2216 of perf-changelog.yaml:
- pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER
+ pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1279(Optional follow-up, out of scope for this PR: tighten pr_link in utils/matrix_logic/validation.py:483 to require a numeric PR id, e.g. pydantic.HttpUrl + a regex check or a custom validator that asserts the path ends with /pull/<int>. That would prevent the same template-substitution miss from sneaking past in the future.)
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25349104528 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25349104528 |
Tests whether H200 MTP kernels accept 2 draft tokens. Blackwell runs at 2 per @wzhao18's vLLM Blackwell MTP submission; checking if H200 has parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a95421e to
db615ed
Compare
|
thanks to inferact vllm community maintainers for fixing hopper top_k kernel and merging my pr for recipes vllm-project/recipes#435 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25356292856 |
…isAI#1279) * dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2 Tests whether H200 MTP kernels accept 2 draft tokens. Blackwell runs at 2 per @wzhao18's vLLM Blackwell MTP submission; checking if H200 has parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: fill in PR SemiAnalysisAI#1279 link Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Bump
num_speculative_tokensfrom 1 to 2 in thedsv4-fp8-h200-vllm-mtprecipe.--speculative-config '{"method":"mtp","num_speculative_tokens":1}'→'{"method":"mtp","num_speculative_tokens":2}'inbenchmarks/single_node/dsv4_fp8_h200_mtp.sh.perf-changelog.yamlentry to trigger the new sweep.Context
The original PR (#1222) landed at
num_speculative_tokens=1because the recipe says H200 MTP kernels only support 1 draft token. Inferact folks indicated H200 kernels now support 2, and Blackwell MTP runs at 2 per @wzhao18's vLLM Blackwell MTP submission. This PR re-tests that assumption on H200. If acceptance / throughput look healthy, we keep 2; otherwise we revert.Test plan
dsv4-fp8-h200-vllm-mtpbenchmark workflow on an H200 runner and confirm the engine starts and the sweep completes for at least one cell from each of the twoseq-len-configs.server.logshows--speculative-config '{"method":"mtp","num_speculative_tokens":2}'.num_speculative_tokens=1numbers from the Add DSv4 FP8 H200 vLLM MTP benchmark #1222 sweep — otherwise the kernel-support assumption was wrong and we revert.🤖 Generated with Claude Code