[NV] qwen3.5 fp4 b200 sglang mtp by hshrivastava-droid · Pull Request #1257 · SemiAnalysisAI/InferenceX

hshrivastava-droid · 2026-05-01T19:15:45Z

Summary

Update Qwen3.5 FP4 B200 SGLang MTP benchmark configuration and server flags for improved performance.

Changes

Image Update

Bump SGLang image from nightly-dev-20260402-d7256eb6 → nightly-dev-20260422-de962f32

Search Space Expansion (nvidia-master.yaml)

Add TP=2 EP=1 sweep (conc 4–128) alongside existing TP=4 for both 1k1k and 8k1k sequence lengths
Pin TP=4 to conc=4 only (was conc 4–128)

Server Flag Overhaul (qwen3.5_fp4_b200_mtp.sh)

Enable SGLANG_ENABLE_SPEC_V2=1 for v2 speculative decoding path
Add --enable-symm-mem and --expert-parallel-size=$EP_SIZE
Switch to dynamic --scheduler-recv-interval based on concurrency (10 if ≤4, 30 if >4)
Reduce --max-prefill-tokens and --chunked-prefill-size from 32768 → 16384
Reduce --mem-fraction-static from 0.85 → 0.8
Set --max-running-requests and --cuda-graph-max-bs to $CONC (was hardcoded 128 / $CONC)
Add --tokenizer-path $MODEL explicitly
Increase --stream-interval from 30 → 50
Remove hardcoded env vars (NCCL_NVLS_ENABLE, SGL_ENABLE_JIT_DEEPGEMM, SGLANG_ENABLE_FLASHINFER_GEMM)
Remove --fp4-gemm-backend flashinfer_cutlass (use default)
Remove conditional --enable-flashinfer-allreduce-fusion for TP=8

Changelog

Add perf-changelog.yaml entry for qwen3.5-fp4-b200-sglang-mtp documenting all changes

hshrivastava-droid · 2026-05-01T19:19:03Z

@claude - add PR description

Klaud-Cold · 2026-05-01T19:19:36Z

Claude finished @hshrivastava-droid's task in 49s —— View job

Adding PR Description

Gather context from changed files
Draft PR description
Update PR description

PR description has been added summarizing all changes: image bump, search space expansion (TP=2 sweep), server flag overhaul, and perf-changelog entry.

claude · 2026-05-01T19:22:05Z

+--chunked-prefill-size 16384 \
+--mem-fraction-static 0.8 \
+--stream-interval 50 \
+--scheduler-recv-interval $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 ) \


🟡 The new scheduler-recv-interval uses $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 ), so any CONC>=5 (e.g. CONC=8 in the new TP=2 sweep) gets interval=30. The perf-changelog says this PR aims to "Align server flags with FP4 B200 STP", but the FP4 B200 STP companion (qwen3.5_fp8_b200.sh:32 — actually qwen3.5_fp4_b200.sh:32) and every other qwen3.5 *_mtp.sh sibling still uses CONC -ge 16 for the 30/10 cutoff. Could you confirm CONC>4 is intentional (matching qwen3.5_fp8_b200.sh) or change to -ge 16 to actually match FP4 B200 STP?

Extended reasoning...

What's happening

benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh:52 sets:

--scheduler-recv-interval $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 )

So the threshold for switching to interval=30 is CONC >= 5. The deleted code in the same file (and every other qwen3.5 *_mtp.sh and the FP4 B200 STP companion) used:

if [[ $CONC -ge 16 ]]; then SCHEDULER_RECV_INTERVAL=30; else SCHEDULER_RECV_INTERVAL=10; fi

i.e. threshold CONC >= 16.

Why this is worth flagging

The perf-changelog entry for this PR says: "Align server flags with FP4 B200 STP: --enable-symm-mem, --expert-parallel-size, dynamic scheduler-recv-interval". But the named reference benchmarks/single_node/qwen3.5_fp4_b200.sh:32 (the FP4 B200 STP script) still uses CONC -ge 16. So the chosen threshold does not match what the PR claims to align with.

What it does match

The new threshold matches benchmarks/single_node/qwen3.5_fp8_b200.sh:51, which was updated in PR #1027 to the CONC -gt 4 pattern. The full launch block in the new MTP script is in fact much closer to qwen3.5_fp8_b200.sh than to qwen3.5_fp4_b200.sh (same --enable-symm-mem, same --max-prefill-tokens 16384, same --stream-interval 50, same --mem-fraction-static 0.8). So the most likely scenario is that the author copy-pasted the launch block from qwen3.5_fp8_b200.sh, not from qwen3.5_fp4_b200.sh.

Concrete impact

In the new TP=2 search space conc-start: 4, conc-end: 128, the swept concurrencies that exist in both this MTP script and the FP4 STP companion are 4, 8, 16, 32, 64, 128. At CONC=8:

This MTP script: scheduler-recv-interval = 30

FP4 B200 STP (qwen3.5_fp4_b200.sh): scheduler-recv-interval = 10

That's a 3x divergence in scheduler batching at one swept point. CONC>=16 already matches under both rules, and CONC=4 also matches. So only CONC=8 actually diverges among standard sweep points — measurement effect is small but real.

Why I'm filing as nit, not normal

It only affects perf data at CONC=8 — this is a tuning knob, not a correctness bug.

It's plausibly intentional: the launch block mirrors the recently-updated qwen3.5_fp8_b200.sh, so the author may have deliberately picked the FP8-B200 pattern.

I'm flagging it because the PR description explicitly names FP4 B200 STP as the alignment target, and the chosen threshold does not actually match that target. Easiest fix: either change to -ge 16 to truly mirror qwen3.5_fp4_b200.sh, or update the perf-changelog to say "align with FP8 B200 STP / FP8 B200 SGLang" instead.

claude · 2026-05-01T19:22:05Z

+    - "Update image to lmsysorg/sglang:nightly-dev-20260422-de962f32"
+    - "Add tp:2 ep:1 conc 4-128 search-space for 1k1k and 8k1k"
+    - "Align server flags with FP4 B200 STP: --enable-symm-mem, --expert-parallel-size, dynamic scheduler-recv-interval"
+    - "Add MTP flags: SGLANG_ENABLE_SPEC_V2=1, EAGLE speculative decoding (steps=3, topk=1, draft=4)"


🟡 The new perf-changelog.yaml entry for qwen3.5-fp4-b200-sglang-mtp uses the literal placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX for its pr-link. Every other entry in this file points to a real PR — please replace XXX with this PR's number (1257) before merge so the changelog stays traceable.

Extended reasoning...

What the bug is

The new entry appended to perf-changelog.yaml (the last block in the file, covering qwen3.5-fp4-b200-sglang-mtp) ends with:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

XXX is a literal placeholder, not a substituted value. The actual PR number is 1257, as shown in the PR metadata.

Why existing entries don't have this issue

Every other entry in perf-changelog.yaml (well over 140 of them) uses a real PR number — e.g. the immediately preceding entry uses pull/1027, and the others span pull/95 through pull/1223. The placeholder XXX is unique to this newly added block and is clearly a stub the author forgot to fill in before pushing.

How it manifests / impact

This is a documentation/metadata defect, not a runtime bug. The benchmark scripts and nvidia-master.yaml config changes work regardless of what is written in perf-changelog.yaml. However, this file is the project's authoritative log mapping config-key changes to the PRs that introduced them; any tooling, reviewer, or future bisecting effort that follows the pr-link for this entry will hit GitHub's 404 page for /pull/XXX (since XXX is not a valid PR number) instead of landing on PR #1257.

Step-by-step proof

Open perf-changelog.yaml and scroll to the bottom — the new entry added by this PR is the last block.

The block's config-keys lists qwen3.5-fp4-b200-sglang-mtp (the new key being introduced in .github/configs/nvidia-master.yaml in this same PR).

Its pr-link field reads https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.

The PR metadata shows this PR is number 1257, so the link should read https://github.com/SemiAnalysisAI/InferenceX/pull/1257.

Compare against the entry directly above it (also for qwen3.5-fp8-b200-sglang), which correctly resolves to pull/1027.

How to fix

Replace XXX with 1257 in the new entry, e.g.:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1257

This is the only change needed; nothing else in the file or the rest of the diff needs to be touched. Severity is nit because it doesn't affect benchmark execution, but it should be fixed before merge to maintain the file's traceability invariant.

hshrivastava-droid · 2026-05-01T21:32:51Z

@claude - rebase and reverse conflicts

Klaud-Cold · 2026-05-01T21:33:18Z

Claude finished @hshrivastava-droid's task in 3m 34s —— View job

Rebase and Resolve Conflicts

Fetch latest origin/main (branch was 2 ahead, 12 behind)
Rebase branch onto origin/main
Resolve conflicts (2 files)
Push rebased branch

Rebased successfully onto origin/main. Resolved conflicts in:

.github/configs/nvidia-master.yaml — adopted the new scenarios: fixed-seq-len: format from main while keeping this PR's search-space changes (TP=4 pinned to conc=4, TP=2 sweep conc 4–128)
perf-changelog.yaml — kept all new entries from main (glm5-fp8-mi355x-atom, dsv4-fp4-gb200-dynamo-vllm-mtp2, glm5-fp8-mi355x-sglang-mtp, qwen3.5-fp4-b200-sglang) and appended this PR's qwen3.5-fp4-b200-sglang-mtp entry with the corrected PR link ([NV] qwen3.5 fp4 b200 sglang mtp #1257)

Branch is now 0 commits behind main.
|

github-actions · 2026-05-01T22:17:31Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25234242656
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25234242656

github-actions · 2026-05-02T00:18:29Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25237446161
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25237446161

github-actions · 2026-05-04T20:51:15Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25237446161
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25237446161

github-actions · 2026-05-04T20:59:51Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25237446161
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25237446161

hshrivastava-droid · 2026-05-04T23:29:06Z

@Klaud-Cold add PR description

hshrivastava-droid · 2026-05-04T23:29:17Z

sglang receipe- sgl-project/sgl-cookbook#266

Klaud-Cold · 2026-05-04T23:29:28Z

Claude finished @hshrivastava-droid's task in 48s —— View job

Adding PR Description

Gather context from changed files
Draft PR description
Update PR description

PR description has been updated with a summary of all changes: image bump, search space expansion (TP=2 sweep + TP=4 pinned to conc=4), server flag overhaul (SPEC_V2, symm-mem, reduced prefill/chunked, dynamic scheduler interval), and perf-changelog entry.

github-actions · 2026-05-05T00:16:18Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25237446161
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25237446161

github-actions · 2026-05-05T00:57:30Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25237446161
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25237446161

hshrivastava-droid requested a review from a team May 1, 2026 19:15

hshrivastava-droid requested review from jgangani and kedarpotdar-nv as code owners May 1, 2026 19:15

github-project-automation Bot added this to InferenceMAX Board May 1, 2026

hshrivastava-droid added NVIDIA full-sweep-enabled labels May 1, 2026

claude Bot reviewed May 1, 2026

View reviewed changes

hshrivastava-droid added 2 commits May 1, 2026 21:36

update flags and image

7f8c16b

update PR number

40aab28

Klaud-Cold force-pushed the nv/qwen3.5-fp4-b200-sglang-mtp branch from 9d2a2cc to 40aab28 Compare May 1, 2026 21:36

functionstackx reviewed May 1, 2026

View reviewed changes

Comment thread benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh

update chat flag

1a1ee81

faradawn mentioned this pull request May 4, 2026

Update Qwen3.5 B200 FP4 MTP SGLang recipe sgl-project/sgl-cookbook#266

Open

hshrivastava-droid changed the title ~~[WIP] [NV] qwen3.5 fp4 b200 sglang mtp~~ [NV] qwen3.5 fp4 b200 sglang mtp May 4, 2026

jgangani approved these changes May 4, 2026

View reviewed changes

functionstackx approved these changes May 5, 2026

View reviewed changes

Conversation

hshrivastava-droid commented May 1, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

hshrivastava-droid commented May 1, 2026

Uh oh!

Klaud-Cold commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

Uh oh!

claude Bot May 1, 2026

Choose a reason for hiding this comment

What's happening

Why this is worth flagging

What it does match

Concrete impact

Why I'm filing as nit, not normal

Uh oh!

claude Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

hshrivastava-droid commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Klaud-Cold commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rebase and Resolve Conflicts

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

hshrivastava-droid commented May 4, 2026

Uh oh!

hshrivastava-droid commented May 4, 2026

Uh oh!

Klaud-Cold commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hshrivastava-droid commented May 1, 2026 •

edited by Klaud-Cold

Loading

Klaud-Cold commented May 1, 2026 •

edited

Loading

hshrivastava-droid commented May 1, 2026 •

edited

Loading

Klaud-Cold commented May 1, 2026 •

edited

Loading

Klaud-Cold commented May 4, 2026 •

edited

Loading