Add examples for MoE models - Mixtral in TE#2642
Add examples for MoE models - Mixtral in TE#2642faradawn wants to merge 35 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR adds a complete TE Mixtral tutorial with Confidence Score: 3/5Not safe to merge: EXPERIMENT_LOG.md exposes internal NVIDIA cluster details in a public repo, and several critical runtime bugs in te_mixtral.py flagged in earlier review rounds remain unaddressed. The P1 finding (EXPERIMENT_LOG.md with internal paths) pulls the ceiling to 4, but the overall state of te_mixtral.py — which still carries unresolved P1 bugs from prior rounds (wrong DTensor isinstance check in _sync_expert_views, null-pointer crashes in the InferenceParams branch, incorrect padding slot assignment) — warrants a lower score. docs/examples/te_mixtral/EXPERIMENT_LOG.md (internal cluster details must be removed), docs/examples/te_mixtral/te_mixtral.py (multiple unresolved runtime bugs from prior review rounds). Important Files Changed
Sequence DiagramsequenceDiagram
participant Input as Input [N, H]
participant Gate as Router (gate)
participant Dispatch as TokenDispatcher<br/>(AllToAll / FusedDeepEP)
participant GU as GroupedLinear<br/>gate_up [E, 2*F, H]
participant Act as SiLU x up
participant Down as GroupedLinear<br/>down [E, H, F]
participant Combine as Combine<br/>(moe_unpermute + all_to_all)
participant Output as Output [N, H]
Input->>Gate: router_logits [N, num_experts]
Gate->>Gate: softmax -> topk -> normalize
Gate->>Dispatch: hidden_states, selected_experts [N, top_k], routing_weights
Dispatch->>Dispatch: moe_permute(index) + all_to_all
Dispatch->>GU: expert_input [T, H], m_splits
GU->>Act: gate_up_output [T, 2F]
Act->>Down: intermediate [T, F]
Down->>Combine: expert_output [T, H]
Combine->>Combine: moe_unpermute(index) + reverse all_to_all
Combine->>Output: weighted sum [N, H]
Reviews (20): Last reviewed commit: "docs(te_mixtral): add architecture diagr..." | Re-trigger Greptile |
|
Hi @sudhakarsingh27 can you check if this addresses your comments? Tested in 2x H100. |
Agreed! I think any example should show some perf gain and include the whole weight mapping so the user can run the example. |
|
Documentation build is not working, if you fix it please ping me and I'll review. |
d7301a6 to
fcc7e35
Compare
72e0e45 to
226173a
Compare
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
- Fix m_splits bug: use (selected_experts == i).sum() instead of .any(dim=-1).sum() * top_k, which caused dimension mismatches in GroupedLinear - Add te_mixtral.py: TEMixtralSparseMoeBlock (GroupedLinear + moe_permute/ unpermute), replace_moe_block context manager, TEMixtralForCausalLM with HF weight loading, and replace_params for expert weight packing - Add utils.py: HyperParameters, data loading, BF16/FP8 model init, Accelerate wrapping, fine-tuning loop — mirrors te_llama/utils.py style - Add requirements.txt matching te_llama versions - Expand notebook from bare code snippet to full tutorial covering: architecture overview, HF vs TE comparison table, unit-test cell, baseline BF16 run, BF16 TE improvement, FP8 TE improvement, expert routing/scaling discussion, generalisation guide for other MoE models Addresses reviewer feedback: fixes the critical runtime bug (greptile) and expands to tutorial quality comparable to the Llama/Gemma examples (sudhakarsingh27), covering the scope requested in issue NVIDIA#2573. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
for more information, see https://pre-commit.ci Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
A few bugs found during code review:
- moe_permute and moe_unpermute were missing map_type='index'. The
selected_experts tensor is [num_tokens, top_k] indices, not a mask,
so without this the routing is completely wrong at runtime.
- num_out_tokens=None should be -1 (the API expects an int).
- moe_unpermute: replaced deprecated probs= with merging_probs= and
kept routing_weights in float32 as TE recommends.
- utils.py: num_warmup_steps was hardcoded to 100 instead of using
hyperparams.num_warmup_steps, which made the benchmark meaningless.
- requirements.txt: transformers==4.57.0 doesn't exist, fixed to 4.47.1.
- Notebook generalisation guide: updated code template with the same fixes.
Tested on 2xH100 in nvcr.io/nvidia/pytorch:26.01-py3 (PyTorch 2.10.0, CUDA 13.1):
$ python3 -c "
from te_mixtral import TEMixtralSparseMoeBlock
from transformers import MixtralConfig
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
import torch
cfg = MixtralConfig(hidden_size=256, intermediate_size=512,
num_local_experts=4, num_experts_per_tok=2)
x = torch.randn(2, 8, cfg.hidden_size, device='cuda', dtype=torch.bfloat16)
hf_out, hf_logits = MixtralSparseMoeBlock(cfg).cuda().bfloat16()(x)
te_out, te_logits = TEMixtralSparseMoeBlock(cfg).cuda().bfloat16()(x)
assert hf_out.shape == te_out.shape
assert hf_logits.shape == te_logits.shape
print('PASS')"
Input shape : torch.Size([2, 8, 256])
Output shape : torch.Size([2, 8, 256]) (matches HF: True)
Logits shape : torch.Size([16, 4]) (matches HF: True)
Output dtype : torch.bfloat16
PASS
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
The docs CI was failing because nbsphinx auto mode tries to execute notebooks with no stored outputs. Add explicit execute:never metadata so the docs builder renders the notebook as-is without running cells. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
…otebook te_mixtral.py: - Replace per-expert .item() loop with torch.bincount + .tolist() (8 GPU syncs → 1), eliminating the main cause of GroupedLinear speedup regression - Remove unnecessary num_out_tokens/max_token_num args from moe_permute - Fix replace_params to use .copy_() instead of fragile .data[] assignment, and load_state_dict from fully-populated te_state in one shot - Add device_map="auto" support in from_hf_model via accelerate.dispatch_model - Rename replace_params -> _pack_expert_weights for clarity tutorial_accelerate_hf_mixtral_with_te.ipynb: - Remove shape-check section (redundant) - Add FP8 prefill and decode-regime benchmark cells with summary table - Restructure training sections to match te_llama pattern: restart at top of each cell, combined hyperparams + init + wrap + finetune, concise markdown - Bump benchmark SEQ 512 -> 2048 for realistic H100 workload utils.py: - Apply user batch size and sequence length adjustments Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
for more information, see https://pre-commit.ci Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
….10 compat for Unpack Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
- Add padding to next multiple of 32 for MXFP8 in two paths: 1. MoE expert FFN: pad dispatched tokens before GroupedLinear 2. THD attention: pad unpacked hidden states and adjust cu_seqlens - Update notebook to use batch=8, seq=256 with measured results: Tier 1 (HF BF16): 603ms, Tier 2 (NCCL EP): 365ms, Tier 3 (DeepEP): 357ms, Tier 4 (MXFP8): 349ms - Add HANDOFF.md with environment setup, run commands, and known issues Made-with: Cursor Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
- Update notebook to use batch=8/seq=256 with correct measured results - Add note recommending Megatron-Core for 4K+ sequence lengths - Add presentation.html with 13-slide technical overview - Update HANDOFF.md with latest benchmark tables Signed-off-by: Faradawn Yang <faradawny@nvidia.com> Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> Made-with: Cursor Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
for more information, see https://pre-commit.ci Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
- nbsphinx: skip executing the Mixtral tutorial during docs build; cells require torch + 8 GPUs which the docs runner does not have. - te_mixtral.py: import Unpack from typing_extensions unconditionally so vermin does not flag Python 3.11 usage under the repo's 3.10 target. - te_mixtral.py: derive padded_seq_len from hidden_states, not input_ids, since input_ids is None when the caller passes inputs_embeds directly. - Tutorial notebook: align code-cell batch-size comments with the Python below, add explicit 'run via torchrun, not Jupyter' notes to Tier 2/3/4 cells, and add a References section pointing to the BioNeMo Mixtral recipes. - Remove scratch files presentation.html and example_xy_algorithm.py. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
c3a50c1 to
590814a
Compare
Sphinx was run with -W (warnings as errors) and flagged the Mixtral notebook as "document isn't included in any toctree", which failed the docs Build job. Add the notebook alongside the existing Llama and Gemma tutorials in the "Examples and Tutorials" toctree in docs/index.rst. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
Accelerate validates log_with at construction time and raises "log_with=wandb requires wandb to be installed" when wandb is absent. The tutorial never calls accelerator.log() or init_trackers(), so the tracker was dead wiring. Drop it instead of adding wandb as a new requirement a tutorial user would never exercise. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
…ming te_mixtral.py: the InferenceParams branch dereferenced input_ids.shape, which crashes when the caller supplies inputs_embeds directly (the XOR guard at the top of forward allows that path). Fall back to inputs_embeds as the shape reference and compare shape[:2] so the check works for both 2D input_ids and 3D inputs_embeds. utils.py: replace the single-pair CUDA event around the whole training loop with per-step events. Each step's elapsed time is printed live from rank 0 as "[step N/total] XXXX ms", and the final summary reports a steady-state estimate (median of the last 5 steps) instead of the old mean over a 3-step burst. The old 3-sample mean was sensitive to first- step transients (cuBLAS heuristic warmup, allocator growth, NCCL plan caching) and produced a non-monotonic Tier-2 batch-size curve. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
|
Hi @pggPL I've fixed the doc build and refined the content - can you help check? |
|
I was not reviewing it in detail, I would like to discuss on high-level flow of the tutorial first.
Apart of that it would be nice to prepare documentation (not tutorial) explaining why grouped linear is needed and which permute kernels we provide. I will work on that in different PR and it will land in Features section in our docs. |
Pin transformers, accelerate, datasets, safetensors, huggingface_hub, and tokenizers in requirements.txt to the exact versions validated on 8x B300 inside NGC pytorch-25.12-py3. torch, torchao, and transformer_engine are left unpinned because they come from the NGC container and pinning them would tie the tutorial to a specific container build. Add a setup-cell note in the tutorial notebook calling out the tested GPU (8x Blackwell B300) and container (pytorch-25.12-py3). Addresses pggPL review comments on requirements.txt: - "pin the versions of some of the frameworks" - "specify GPU and maybe container to make sure the users can rerun it" Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
| if isinstance(past_key_values, InferenceParams): | ||
| # input_ids is None when the caller supplies inputs_embeds directly. | ||
| _ref = input_ids if input_ids is not None else inputs_embeds | ||
| lengths = ( | ||
| attention_mask.sum(dim=1).tolist() | ||
| if attention_mask.shape[:2] == _ref.shape[:2] | ||
| else [1] * _ref.shape[0] | ||
| ) | ||
| past_key_values.pre_step(OrderedDict(zip(list(range(len(lengths))), lengths))) |
There was a problem hiding this comment.
attention_mask is None crash inside InferenceParams branch
attention_mask is an optional parameter with no None-guard before line 852. In the decode step of model.generate() the mask is frequently omitted, and attention_mask.shape[:2] will raise AttributeError: 'NoneType' object has no attribute 'shape', crashing every auto-regressive generation call.
if isinstance(past_key_values, InferenceParams):
_ref = input_ids if input_ids is not None else inputs_embeds
lengths = (
attention_mask.sum(dim=1).tolist()
if attention_mask is not None and attention_mask.shape[:2] == _ref.shape[:2]
else [1] * _ref.shape[0]
)
past_key_values.pre_step(OrderedDict(zip(list(range(len(lengths))), lengths)))Adds a configurable expert FFN execution mode to NVMixtralSparseMoeBlock:
- "grouped" (default): existing TE GroupedLinear path
- "loop": Python loop over experts with one F.linear per expert
(HF-style), running against the same stacked weight tensor
so swapping between modes is a runtime config flag, no
checkpoint rework
Adds a new "Tier 2" to the tutorial's progressive optimization chain that
isolates EP+TE primitives without GroupedLinear, so the GroupedLinear
contribution can be measured independently. Renumbers run_finetune_ep.py:
1 = HF baseline BF16 (device_map="auto")
2 = TE EP BF16 -- Python loop, F.linear per expert [new]
3 = TE EP BF16 -- GroupedLinear (was Tier 2)
4 = TE EP BF16 -- GroupedLinear + Fused DeepEP (was Tier 3)
5 = TE EP FP8 -- Float8CurrentScaling + grouped + DeepEP
Tier 5 switches the FP8 recipe from MXFP8BlockScaling to
Float8CurrentScaling -- the per-tensor scaling avoids the 32-aligned
tile constraint that MXFP8 imposes on grouped-matmul, so Tier 5 runs
cleanly at EP < num_experts where MXFP8 trips the per-expert padding
assertion.
Also threads the configuration through the harness:
- HyperParameters.expert_ffn_mode
- --ep-size default changed from 8 to 2 (4 experts/rank gives a
non-degenerate loop-vs-grouped contrast; pass --ep-size 8 for the
original 1-expert/rank config)
- Allow WORLD_SIZE > EP via a 2D (DP, EP) DeviceMesh so EP=2 with 8
GPUs (DP=4) works end-to-end
- replace_params: use EP-rank-within-group (rank % ep_size) instead of
global rank for the expert slicing -- otherwise ranks beyond ep_size
slice past the end of the HF stacked weight tensor
Misc: cycle the dataloader (small benchmark datasets like
openassistant-guanaco exhaust at large batch * num_steps), and trim the
fine-tune summary output to median + last-step (was mean + median +
min/max/p99).
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
for more information, see https://pre-commit.ci
| """ | ||
| gate_up_w = self.experts_gate_up_weight | ||
| if isinstance(gate_up_w, DTensor): | ||
| gate_up_w = gate_up_w.to_local() | ||
| for i in range(self.num_local_experts): | ||
| object.__setattr__(self.experts_gate_up, f"weight{i}", gate_up_w[i]) | ||
|
|
||
| down_w = self.experts_down_weight | ||
| if isinstance(down_w, DTensor): | ||
| down_w = down_w.to_local() |
There was a problem hiding this comment.
Wrong DTensor check — GroupedLinear receives DTensor views under EP
self.experts_gate_up_weight is an nn.Parameter. After set_ep_group() wraps it with nn.Parameter(DTensor.from_local(...)), the parameter's type is still nn.Parameter (not DTensor), so isinstance(gate_up_w, DTensor) is always False. The guard never fires, to_local() is never called, and every subsequent gate_up_w[i] slice handed to object.__setattr__ is a DTensor. TE's GroupedLinear CUDA kernels cannot accept DTensor views and will either raise a type error or silently use incorrect data.
The _expert_ffn loop path already applies the correct pattern with .data:
if isinstance(gate_up_w.data, DTensor):
gate_up_w = gate_up_w.data.to_local()| """ | |
| gate_up_w = self.experts_gate_up_weight | |
| if isinstance(gate_up_w, DTensor): | |
| gate_up_w = gate_up_w.to_local() | |
| for i in range(self.num_local_experts): | |
| object.__setattr__(self.experts_gate_up, f"weight{i}", gate_up_w[i]) | |
| down_w = self.experts_down_weight | |
| if isinstance(down_w, DTensor): | |
| down_w = down_w.to_local() | |
| gate_up_w = self.experts_gate_up_weight | |
| if isinstance(gate_up_w.data, DTensor): | |
| gate_up_w = gate_up_w.data.to_local() | |
| for i in range(self.num_local_experts): | |
| object.__setattr__(self.experts_gate_up, f"weight{i}", gate_up_w[i]) | |
| down_w = self.experts_down_weight | |
| if isinstance(down_w.data, DTensor): | |
| down_w = down_w.data.to_local() | |
| for i in range(self.num_local_experts): | |
| object.__setattr__(self.experts_down, f"weight{i}", down_w[i]) |
Adds EXPERIMENT_LOG.md with the GroupedLinear vs naive-loop investigation
(EP=2 batch sweep, batch=8 nsys + NVTX per-phase breakdown) and the
my_expert_ffn_{loop|grouped} / my_grouped_linear_{gate_up,down} NVTX
ranges that produced the per-phase numbers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
| remainder = orig_num_tokens % 32 | ||
| if remainder != 0: | ||
| pad_count = 32 - remainder | ||
| expert_input = torch.nn.functional.pad(expert_input, (0, 0, 0, pad_count)) | ||
| tokens_per_expert = list(tokens_per_expert) | ||
| tokens_per_expert[-1] += pad_count |
There was a problem hiding this comment.
Padding appended to the last expert's slot regardless of which expert has remaining tokens
When orig_num_tokens % 32 != 0, pad_count tokens are appended to expert_input and tokens_per_expert[-1] is incremented by pad_count. This is only correct if the padded tokens happen to logically belong to the last expert. If the last expert already has 0 tokens on this rank (a routine EP scenario), GroupedLinear will schedule a GEMM for pad_count zero-padded tokens against that expert's weights, inflating compute and potentially triggering a misaligned-size error rather than a clean no-op. A safer approach is to distribute the padding tokens as a trailing dummy entry (tokens_per_expert.append(pad_count)) or to pad at batch-preparation time before dispatch so padding is always aligned with a real expert slot.
|
Will wait on this PR: #2923 |
…p_proj/down_proj naming Adds three drawio + SVG diagram pairs under docs/examples/te_mixtral/media/: - dense_to_sparse: dense Transformer block vs sparse MoE block (router + 8 experts). - mixtral_decoder_swap: Llama-tutorial-style API key->key map from HF MixtralDecoderLayer to TE NVMixtralDecoderLayer (fused QKV, packed expert tensors). - moe_loop_vs_grouped: zoom into the MoE block; HF per-expert loop vs TE GroupedLinear with variable per-expert token counts (m_splits). Adds a new "Architecture Overview" cell at the top of the tutorial notebook with the two main figures, and renames the MoE weight references throughout the notebook prose and tables from the older w1/w2/w3 form to the newer packed mlp.experts.gate_up_proj / mlp.experts.down_proj keys (the form used by BioNemo's bionemo-recipes/models/mixtral/ convert.py and accepted by replace_params() in te_mixtral.py). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| `/lustre/share/coreai_prod_infbench/exp3_nvtx_per_phase/`. | ||
|
|
||
| --- | ||
|
|
||
| All experiments run on a **SLURM cluster**. The login node has no CUDA, so any | ||
| experiment must execute inside an allocation on a compute node with 8 GPUs. | ||
|
|
||
| ### Reserving a fresh node (8x GPU, 4h, exclusive) | ||
|
|
||
| ```bash | ||
| srun -A coreai_prod_infbench \ | ||
| -p batch \ | ||
| -N 1 \ | ||
| -J myjob \ | ||
| -t 04:00:00 \ | ||
| --exclusive \ | ||
| --mpi=pmix \ | ||
| --container-image=/lustre/fsw/coreai_prod_infbench/faradawny/docker/pytorch-25.12-py3.sqsh \ | ||
| --container-save=/lustre/fsw/coreai_prod_infbench/faradawny/docker/pytorch-25.12-py3.sqsh \ | ||
| --container-name=mycontainer \ | ||
| --container-mounts=/lustre:/lustre \ | ||
| --pty bash | ||
| ``` | ||
|
|
||
| This drops you into an interactive shell inside the NGC `pytorch-25.12-py3` | ||
| container with `/lustre` mounted and the container saved as `mycontainer`. |
There was a problem hiding this comment.
Internal cluster details in a public repository
EXPERIMENT_LOG.md is a personal working-notes file that contains NVIDIA-internal infrastructure details that should not be committed to a public GitHub repo:
- A developer username embedded in paths (
faradawny):/lustre/fsw/coreai_prod_infbench/faradawny/docker/pytorch-25.12-py3.sqsh - An internal SLURM account:
coreai_prod_infbench - Internal shared storage paths:
/lustre/share/coreai_prod_infbench/exp3_nvtx_per_phase/
This file looks like a benchmark log intended for internal coordination, not documentation for external tutorial users. Consider removing it entirely or replacing it with a trimmed BENCHMARKS.md that shows only public results (median step times, speedup numbers) without cluster-specific details.
Summary
This PR adds a complete tutorial for integrating HuggingFace Mixtral (MoE) with Transformer Engine, addressing the gap identified in #2573.
What's included
te_mixtral.py— Drop-inTEMixtralSparseMoeBlockthat replaces HF's loop-over-experts with TE'sGroupedLinear(batched GEMM) +moe_permute/moe_unpermute. Includesreplace_moe_blockcontext manager,TEMixtralForCausalLMwith HF weight loading, andreplace_paramsfor expert weight packing.utils.py— Data loading, BF16/FP8 model init, Accelerate wrapping, fine-tuning loop — mirrorste_llama/utils.pystyle.requirements.txt— Pinned dependencies matching the Llama/Gemma tutorials.te_llamaandte_gemma, covering:Bug fix
Corrected the
m_splitscalculation flagged by the automated review:Scope
Covers all topics requested in #2573: