[NV] qwen3.5 fp4 b200 sglang mtp #1257
Conversation
|
@claude - add PR description |
|
Claude finished @hshrivastava-droid's task in 49s —— View job Adding PR Description
PR description has been added summarizing all changes: image bump, search space expansion (TP=2 sweep), server flag overhaul, and perf-changelog entry. |
| --chunked-prefill-size 16384 \ | ||
| --mem-fraction-static 0.8 \ | ||
| --stream-interval 50 \ | ||
| --scheduler-recv-interval $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 ) \ |
There was a problem hiding this comment.
🟡 The new scheduler-recv-interval uses $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 ), so any CONC>=5 (e.g. CONC=8 in the new TP=2 sweep) gets interval=30. The perf-changelog says this PR aims to "Align server flags with FP4 B200 STP", but the FP4 B200 STP companion (qwen3.5_fp8_b200.sh:32 — actually qwen3.5_fp4_b200.sh:32) and every other qwen3.5 *_mtp.sh sibling still uses CONC -ge 16 for the 30/10 cutoff. Could you confirm CONC>4 is intentional (matching qwen3.5_fp8_b200.sh) or change to -ge 16 to actually match FP4 B200 STP?
Extended reasoning...
What's happening
benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh:52 sets:
--scheduler-recv-interval $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 )So the threshold for switching to interval=30 is CONC >= 5. The deleted code in the same file (and every other qwen3.5 *_mtp.sh and the FP4 B200 STP companion) used:
if [[ $CONC -ge 16 ]]; then SCHEDULER_RECV_INTERVAL=30; else SCHEDULER_RECV_INTERVAL=10; fii.e. threshold CONC >= 16.
Why this is worth flagging
The perf-changelog entry for this PR says: "Align server flags with FP4 B200 STP: --enable-symm-mem, --expert-parallel-size, dynamic scheduler-recv-interval". But the named reference benchmarks/single_node/qwen3.5_fp4_b200.sh:32 (the FP4 B200 STP script) still uses CONC -ge 16. So the chosen threshold does not match what the PR claims to align with.
What it does match
The new threshold matches benchmarks/single_node/qwen3.5_fp8_b200.sh:51, which was updated in PR #1027 to the CONC -gt 4 pattern. The full launch block in the new MTP script is in fact much closer to qwen3.5_fp8_b200.sh than to qwen3.5_fp4_b200.sh (same --enable-symm-mem, same --max-prefill-tokens 16384, same --stream-interval 50, same --mem-fraction-static 0.8). So the most likely scenario is that the author copy-pasted the launch block from qwen3.5_fp8_b200.sh, not from qwen3.5_fp4_b200.sh.
Concrete impact
In the new TP=2 search space conc-start: 4, conc-end: 128, the swept concurrencies that exist in both this MTP script and the FP4 STP companion are 4, 8, 16, 32, 64, 128. At CONC=8:
- This MTP script:
scheduler-recv-interval = 30 - FP4 B200 STP (qwen3.5_fp4_b200.sh):
scheduler-recv-interval = 10
That's a 3x divergence in scheduler batching at one swept point. CONC>=16 already matches under both rules, and CONC=4 also matches. So only CONC=8 actually diverges among standard sweep points — measurement effect is small but real.
Why I'm filing as nit, not normal
- It only affects perf data at CONC=8 — this is a tuning knob, not a correctness bug.
- It's plausibly intentional: the launch block mirrors the recently-updated qwen3.5_fp8_b200.sh, so the author may have deliberately picked the FP8-B200 pattern.
I'm flagging it because the PR description explicitly names FP4 B200 STP as the alignment target, and the chosen threshold does not actually match that target. Easiest fix: either change to -ge 16 to truly mirror qwen3.5_fp4_b200.sh, or update the perf-changelog to say "align with FP8 B200 STP / FP8 B200 SGLang" instead.
| - "Update image to lmsysorg/sglang:nightly-dev-20260422-de962f32" | ||
| - "Add tp:2 ep:1 conc 4-128 search-space for 1k1k and 8k1k" | ||
| - "Align server flags with FP4 B200 STP: --enable-symm-mem, --expert-parallel-size, dynamic scheduler-recv-interval" | ||
| - "Add MTP flags: SGLANG_ENABLE_SPEC_V2=1, EAGLE speculative decoding (steps=3, topk=1, draft=4)" |
There was a problem hiding this comment.
🟡 The new perf-changelog.yaml entry for qwen3.5-fp4-b200-sglang-mtp uses the literal placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX for its pr-link. Every other entry in this file points to a real PR — please replace XXX with this PR's number (1257) before merge so the changelog stays traceable.
Extended reasoning...
What the bug is
The new entry appended to perf-changelog.yaml (the last block in the file, covering qwen3.5-fp4-b200-sglang-mtp) ends with:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXXXX is a literal placeholder, not a substituted value. The actual PR number is 1257, as shown in the PR metadata.
Why existing entries don't have this issue
Every other entry in perf-changelog.yaml (well over 140 of them) uses a real PR number — e.g. the immediately preceding entry uses pull/1027, and the others span pull/95 through pull/1223. The placeholder XXX is unique to this newly added block and is clearly a stub the author forgot to fill in before pushing.
How it manifests / impact
This is a documentation/metadata defect, not a runtime bug. The benchmark scripts and nvidia-master.yaml config changes work regardless of what is written in perf-changelog.yaml. However, this file is the project's authoritative log mapping config-key changes to the PRs that introduced them; any tooling, reviewer, or future bisecting effort that follows the pr-link for this entry will hit GitHub's 404 page for /pull/XXX (since XXX is not a valid PR number) instead of landing on PR #1257.
Step-by-step proof
- Open
perf-changelog.yamland scroll to the bottom — the new entry added by this PR is the last block. - The block's
config-keyslistsqwen3.5-fp4-b200-sglang-mtp(the new key being introduced in.github/configs/nvidia-master.yamlin this same PR). - Its
pr-linkfield readshttps://github.com/SemiAnalysisAI/InferenceX/pull/XXX. - The PR metadata shows this PR is number 1257, so the link should read
https://github.com/SemiAnalysisAI/InferenceX/pull/1257. - Compare against the entry directly above it (also for
qwen3.5-fp8-b200-sglang), which correctly resolves topull/1027.
How to fix
Replace XXX with 1257 in the new entry, e.g.:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1257This is the only change needed; nothing else in the file or the rest of the diff needs to be touched. Severity is nit because it doesn't affect benchmark execution, but it should be fixed before merge to maintain the file's traceability invariant.
|
@claude - rebase and reverse conflicts |
|
Claude finished @hshrivastava-droid's task in 3m 34s —— View job Rebase and Resolve Conflicts
Rebased successfully onto origin/main. Resolved conflicts in:
Branch is now 0 commits behind main. |
9d2a2cc to
40aab28
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25234242656 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25237446161 |
2 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25237446161 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25237446161 |
|
@Klaud-Cold add PR description |
|
sglang receipe- sgl-project/sgl-cookbook#266 |
|
Claude finished @hshrivastava-droid's task in 48s —— View job Adding PR Description
PR description has been updated with a summary of all changes: image bump, search space expansion (TP=2 sweep + TP=4 pinned to conc=4), server flag overhaul (SPEC_V2, symm-mem, reduced prefill/chunked, dynamic scheduler interval), and perf-changelog entry. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25237446161 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25237446161 |
Summary
Update Qwen3.5 FP4 B200 SGLang MTP benchmark configuration and server flags for improved performance.
Changes
Image Update
nightly-dev-20260402-d7256eb6→nightly-dev-20260422-de962f32Search Space Expansion (
nvidia-master.yaml)Server Flag Overhaul (
qwen3.5_fp4_b200_mtp.sh)SGLANG_ENABLE_SPEC_V2=1for v2 speculative decoding path--enable-symm-memand--expert-parallel-size=$EP_SIZE--scheduler-recv-intervalbased on concurrency (10 if ≤4, 30 if >4)--max-prefill-tokensand--chunked-prefill-sizefrom 32768 → 16384--mem-fraction-staticfrom 0.85 → 0.8--max-running-requestsand--cuda-graph-max-bsto$CONC(was hardcoded 128 / $CONC)--tokenizer-path $MODELexplicitly--stream-intervalfrom 30 → 50NCCL_NVLS_ENABLE,SGL_ENABLE_JIT_DEEPGEMM,SGLANG_ENABLE_FLASHINFER_GEMM)--fp4-gemm-backend flashinfer_cutlass(use default)--enable-flashinfer-allreduce-fusionfor TP=8Changelog
perf-changelog.yamlentry forqwen3.5-fp4-b200-sglang-mtpdocumenting all changes