Skip to content

[NV] qwen3.5 fp4 b200 sglang mtp #1257

Open
hshrivastava-droid wants to merge 3 commits intomainfrom
nv/qwen3.5-fp4-b200-sglang-mtp
Open

[NV] qwen3.5 fp4 b200 sglang mtp #1257
hshrivastava-droid wants to merge 3 commits intomainfrom
nv/qwen3.5-fp4-b200-sglang-mtp

Conversation

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator

@hshrivastava-droid hshrivastava-droid commented May 1, 2026

Summary

Update Qwen3.5 FP4 B200 SGLang MTP benchmark configuration and server flags for improved performance.

Changes

Image Update

  • Bump SGLang image from nightly-dev-20260402-d7256eb6nightly-dev-20260422-de962f32

Search Space Expansion (nvidia-master.yaml)

  • Add TP=2 EP=1 sweep (conc 4–128) alongside existing TP=4 for both 1k1k and 8k1k sequence lengths
  • Pin TP=4 to conc=4 only (was conc 4–128)

Server Flag Overhaul (qwen3.5_fp4_b200_mtp.sh)

  • Enable SGLANG_ENABLE_SPEC_V2=1 for v2 speculative decoding path
  • Add --enable-symm-mem and --expert-parallel-size=$EP_SIZE
  • Switch to dynamic --scheduler-recv-interval based on concurrency (10 if ≤4, 30 if >4)
  • Reduce --max-prefill-tokens and --chunked-prefill-size from 32768 → 16384
  • Reduce --mem-fraction-static from 0.85 → 0.8
  • Set --max-running-requests and --cuda-graph-max-bs to $CONC (was hardcoded 128 / $CONC)
  • Add --tokenizer-path $MODEL explicitly
  • Increase --stream-interval from 30 → 50
  • Remove hardcoded env vars (NCCL_NVLS_ENABLE, SGL_ENABLE_JIT_DEEPGEMM, SGLANG_ENABLE_FLASHINFER_GEMM)
  • Remove --fp4-gemm-backend flashinfer_cutlass (use default)
  • Remove conditional --enable-flashinfer-allreduce-fusion for TP=8

Changelog

  • Add perf-changelog.yaml entry for qwen3.5-fp4-b200-sglang-mtp documenting all changes

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator Author

@claude - add PR description

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 1, 2026

Claude finished @hshrivastava-droid's task in 49s —— View job


Adding PR Description

  • Gather context from changed files
  • Draft PR description
  • Update PR description

PR description has been added summarizing all changes: image bump, search space expansion (TP=2 sweep), server flag overhaul, and perf-changelog entry.

Comment thread benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh Outdated
--chunked-prefill-size 16384 \
--mem-fraction-static 0.8 \
--stream-interval 50 \
--scheduler-recv-interval $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 ) \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new scheduler-recv-interval uses $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 ), so any CONC>=5 (e.g. CONC=8 in the new TP=2 sweep) gets interval=30. The perf-changelog says this PR aims to "Align server flags with FP4 B200 STP", but the FP4 B200 STP companion (qwen3.5_fp8_b200.sh:32 — actually qwen3.5_fp4_b200.sh:32) and every other qwen3.5 *_mtp.sh sibling still uses CONC -ge 16 for the 30/10 cutoff. Could you confirm CONC>4 is intentional (matching qwen3.5_fp8_b200.sh) or change to -ge 16 to actually match FP4 B200 STP?

Extended reasoning...

What's happening

benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh:52 sets:

--scheduler-recv-interval $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 )

So the threshold for switching to interval=30 is CONC >= 5. The deleted code in the same file (and every other qwen3.5 *_mtp.sh and the FP4 B200 STP companion) used:

if [[ $CONC -ge 16 ]]; then SCHEDULER_RECV_INTERVAL=30; else SCHEDULER_RECV_INTERVAL=10; fi

i.e. threshold CONC >= 16.

Why this is worth flagging

The perf-changelog entry for this PR says: "Align server flags with FP4 B200 STP: --enable-symm-mem, --expert-parallel-size, dynamic scheduler-recv-interval". But the named reference benchmarks/single_node/qwen3.5_fp4_b200.sh:32 (the FP4 B200 STP script) still uses CONC -ge 16. So the chosen threshold does not match what the PR claims to align with.

What it does match

The new threshold matches benchmarks/single_node/qwen3.5_fp8_b200.sh:51, which was updated in PR #1027 to the CONC -gt 4 pattern. The full launch block in the new MTP script is in fact much closer to qwen3.5_fp8_b200.sh than to qwen3.5_fp4_b200.sh (same --enable-symm-mem, same --max-prefill-tokens 16384, same --stream-interval 50, same --mem-fraction-static 0.8). So the most likely scenario is that the author copy-pasted the launch block from qwen3.5_fp8_b200.sh, not from qwen3.5_fp4_b200.sh.

Concrete impact

In the new TP=2 search space conc-start: 4, conc-end: 128, the swept concurrencies that exist in both this MTP script and the FP4 STP companion are 4, 8, 16, 32, 64, 128. At CONC=8:

  • This MTP script: scheduler-recv-interval = 30
  • FP4 B200 STP (qwen3.5_fp4_b200.sh): scheduler-recv-interval = 10

That's a 3x divergence in scheduler batching at one swept point. CONC>=16 already matches under both rules, and CONC=4 also matches. So only CONC=8 actually diverges among standard sweep points — measurement effect is small but real.

Why I'm filing as nit, not normal

  • It only affects perf data at CONC=8 — this is a tuning knob, not a correctness bug.
  • It's plausibly intentional: the launch block mirrors the recently-updated qwen3.5_fp8_b200.sh, so the author may have deliberately picked the FP8-B200 pattern.

I'm flagging it because the PR description explicitly names FP4 B200 STP as the alignment target, and the chosen threshold does not actually match that target. Easiest fix: either change to -ge 16 to truly mirror qwen3.5_fp4_b200.sh, or update the perf-changelog to say "align with FP8 B200 STP / FP8 B200 SGLang" instead.

Comment thread perf-changelog.yaml
- "Update image to lmsysorg/sglang:nightly-dev-20260422-de962f32"
- "Add tp:2 ep:1 conc 4-128 search-space for 1k1k and 8k1k"
- "Align server flags with FP4 B200 STP: --enable-symm-mem, --expert-parallel-size, dynamic scheduler-recv-interval"
- "Add MTP flags: SGLANG_ENABLE_SPEC_V2=1, EAGLE speculative decoding (steps=3, topk=1, draft=4)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog.yaml entry for qwen3.5-fp4-b200-sglang-mtp uses the literal placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX for its pr-link. Every other entry in this file points to a real PR — please replace XXX with this PR's number (1257) before merge so the changelog stays traceable.

Extended reasoning...

What the bug is

The new entry appended to perf-changelog.yaml (the last block in the file, covering qwen3.5-fp4-b200-sglang-mtp) ends with:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

XXX is a literal placeholder, not a substituted value. The actual PR number is 1257, as shown in the PR metadata.

Why existing entries don't have this issue

Every other entry in perf-changelog.yaml (well over 140 of them) uses a real PR number — e.g. the immediately preceding entry uses pull/1027, and the others span pull/95 through pull/1223. The placeholder XXX is unique to this newly added block and is clearly a stub the author forgot to fill in before pushing.

How it manifests / impact

This is a documentation/metadata defect, not a runtime bug. The benchmark scripts and nvidia-master.yaml config changes work regardless of what is written in perf-changelog.yaml. However, this file is the project's authoritative log mapping config-key changes to the PRs that introduced them; any tooling, reviewer, or future bisecting effort that follows the pr-link for this entry will hit GitHub's 404 page for /pull/XXX (since XXX is not a valid PR number) instead of landing on PR #1257.

Step-by-step proof

  1. Open perf-changelog.yaml and scroll to the bottom — the new entry added by this PR is the last block.
  2. The block's config-keys lists qwen3.5-fp4-b200-sglang-mtp (the new key being introduced in .github/configs/nvidia-master.yaml in this same PR).
  3. Its pr-link field reads https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.
  4. The PR metadata shows this PR is number 1257, so the link should read https://github.com/SemiAnalysisAI/InferenceX/pull/1257.
  5. Compare against the entry directly above it (also for qwen3.5-fp8-b200-sglang), which correctly resolves to pull/1027.

How to fix

Replace XXX with 1257 in the new entry, e.g.:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1257

This is the only change needed; nothing else in the file or the rest of the diff needs to be touched. Severity is nit because it doesn't affect benchmark execution, but it should be fixed before merge to maintain the file's traceability invariant.

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator Author

hshrivastava-droid commented May 1, 2026

@claude - rebase and reverse conflicts

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 1, 2026

Claude finished @hshrivastava-droid's task in 3m 34s —— View job


Rebase and Resolve Conflicts

  • Fetch latest origin/main (branch was 2 ahead, 12 behind)
  • Rebase branch onto origin/main
  • Resolve conflicts (2 files)
  • Push rebased branch

Rebased successfully onto origin/main. Resolved conflicts in:

  • .github/configs/nvidia-master.yaml — adopted the new scenarios: fixed-seq-len: format from main while keeping this PR's search-space changes (TP=4 pinned to conc=4, TP=2 sweep conc 4–128)
  • perf-changelog.yaml — kept all new entries from main (glm5-fp8-mi355x-atom, dsv4-fp4-gb200-dynamo-vllm-mtp2, glm5-fp8-mi355x-sglang-mtp, qwen3.5-fp4-b200-sglang) and appended this PR's qwen3.5-fp4-b200-sglang-mtp entry with the corrected PR link ([NV] qwen3.5 fp4 b200 sglang mtp  #1257)

Branch is now 0 commits behind main.
|

@Klaud-Cold Klaud-Cold force-pushed the nv/qwen3.5-fp4-b200-sglang-mtp branch from 9d2a2cc to 40aab28 Compare May 1, 2026 21:36
Comment thread benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator Author

@Klaud-Cold add PR description

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator Author

sglang receipe- sgl-project/sgl-cookbook#266

@hshrivastava-droid hshrivastava-droid changed the title [WIP] [NV] qwen3.5 fp4 b200 sglang mtp [NV] qwen3.5 fp4 b200 sglang mtp May 4, 2026
@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 4, 2026

Claude finished @hshrivastava-droid's task in 48s —— View job


Adding PR Description

  • Gather context from changed files
  • Draft PR description
  • Update PR description

PR description has been updated with a summary of all changes: image bump, search space expansion (TP=2 sweep + TP=4 pinned to conc=4), server flag overhaul (SPEC_V2, symm-mem, reduced prefill/chunked, dynamic scheduler interval), and perf-changelog entry.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants