Skip to content

Add examples for MoE models - Mixtral in TE#2642

Open
faradawn wants to merge 35 commits intoNVIDIA:mainfrom
faradawn:add-moe-example
Open

Add examples for MoE models - Mixtral in TE#2642
faradawn wants to merge 35 commits intoNVIDIA:mainfrom
faradawn:add-moe-example

Conversation

@faradawn
Copy link
Copy Markdown
Contributor

@faradawn faradawn commented Feb 2, 2026

Summary

This PR adds a complete tutorial for integrating HuggingFace Mixtral (MoE) with Transformer Engine, addressing the gap identified in #2573.

What's included

  • te_mixtral.py — Drop-in TEMixtralSparseMoeBlock that replaces HF's loop-over-experts with TE's GroupedLinear (batched GEMM) + moe_permute/moe_unpermute. Includes replace_moe_block context manager, TEMixtralForCausalLM with HF weight loading, and replace_params for expert weight packing.
  • utils.py — Data loading, BF16/FP8 model init, Accelerate wrapping, fine-tuning loop — mirrors te_llama/utils.py style.
  • requirements.txt — Pinned dependencies matching the Llama/Gemma tutorials.
  • Tutorial notebook — Full tutorial matching the quality bar of te_llama and te_gemma, covering:
    1. Architecture overview: Transformer → Mixtral MoE, HF bottleneck, TE approach
    2. Unit-test cell verifying output shape/dtype against the HF block
    3. [Baseline] HF Mixtral in BF16
    4. [Improvement 1] TE GroupedLinear MoE in BF16
    5. [Improvement 2] TE GroupedLinear MoE in FP8
    6. Expert routing considerations with mixed precision (m_splits, per-expert FP8 scaling, aux loss passthrough)
    7. Generalisation guide for other MoE architectures (DeepSeek, Grok-1, etc.)

Bug fix

Corrected the m_splits calculation flagged by the automated review:

# Before (wrong): double-counts tokens by reducing with .any() then multiplying by top_k
expert_mask = (selected_experts == expert_idx).any(dim=-1)
m_splits.append(expert_mask.sum().item() * self.top_k)

# After (correct): count the actual number of (token, top_k_slot) pairs per expert
m_splits = [(selected_experts == i).sum().item() for i in range(self.num_experts)]

Scope

Covers all topics requested in #2573:

  • How to wrap MoE layers with TE modules ✓
  • FP8 training configuration for MoE ✓
  • Expert routing considerations with mixed precision ✓
  • Generalisation to arbitrary MoE architectures ✓

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 2, 2026

Greptile Summary

This PR adds a complete TE Mixtral tutorial with NVMixtralSparseMoeBlock (TE GroupedLinear + moe_permute/moe_unpermute), an AllToAllTokenDispatcher for expert parallelism, optional DeepEP fused dispatch, FP8 support, and a fine-tuning harness. Several issues flagged in prior rounds have been fixed (class renames, wandb dep, collator shipping), but EXPERIMENT_LOG.md — a personal benchmark log containing NVIDIA-internal cluster paths and a developer username — has been added to the public repo and should be removed or sanitised before merge.

Confidence Score: 3/5

Not safe to merge: EXPERIMENT_LOG.md exposes internal NVIDIA cluster details in a public repo, and several critical runtime bugs in te_mixtral.py flagged in earlier review rounds remain unaddressed.

The P1 finding (EXPERIMENT_LOG.md with internal paths) pulls the ceiling to 4, but the overall state of te_mixtral.py — which still carries unresolved P1 bugs from prior rounds (wrong DTensor isinstance check in _sync_expert_views, null-pointer crashes in the InferenceParams branch, incorrect padding slot assignment) — warrants a lower score.

docs/examples/te_mixtral/EXPERIMENT_LOG.md (internal cluster details must be removed), docs/examples/te_mixtral/te_mixtral.py (multiple unresolved runtime bugs from prior review rounds).

Important Files Changed

Filename Overview
docs/examples/te_mixtral/te_mixtral.py Core TE Mixtral implementation: NVMixtralSparseMoeBlock, AllToAllTokenDispatcher, weight mapping. Multiple previously-flagged P1 bugs remain unresolved (DTensor check in _sync_expert_views, null-pointer crashes in InferenceParams branch, incorrect token padding slot assignment).
docs/examples/te_mixtral/EXPERIMENT_LOG.md Personal benchmark log containing NVIDIA-internal cluster paths (SLURM account, developer username, lustre paths). Should not be committed to a public repository.
docs/examples/te_mixtral/utils.py Training utilities: model init, Accelerate wrapping, fine-tuning loop. Previously-flagged issues (hardcoded warmup steps, missing wandb dep, broken class name import) appear to be addressed in current version.
docs/examples/te_mixtral/collator.py THD sequence-packing collator (previously shipped as bionemo_mixtral, now local). Logic is correct; replaces the previously-broken external dependency.
docs/examples/te_mixtral/fused_token_router.py DeepEP-backed TokenDispatcher with correct map_type="mask" for the multihot path and merging_probs kwarg.
docs/examples/te_mixtral/requirements.txt Pinned HF stack dependencies; wandb (previously missing) no longer needed since log_with was removed from Accelerator call.
docs/examples/te_mixtral/install_hybridep.sh DeepEP build script with GPU arch detection and NVSHMEM-disable patch. Contains an internal NVIDIA bug tracker URL (nvbugspro.nvidia.com) in a public-facing comment.
docs/examples/te_mixtral/test_accuracy.py Weight-mapping parity test between HF and TE models. Clean and minimal; correctly filters _extra_state keys from missing list.

Sequence Diagram

sequenceDiagram
    participant Input as Input [N, H]
    participant Gate as Router (gate)
    participant Dispatch as TokenDispatcher<br/>(AllToAll / FusedDeepEP)
    participant GU as GroupedLinear<br/>gate_up [E, 2*F, H]
    participant Act as SiLU x up
    participant Down as GroupedLinear<br/>down [E, H, F]
    participant Combine as Combine<br/>(moe_unpermute + all_to_all)
    participant Output as Output [N, H]

    Input->>Gate: router_logits [N, num_experts]
    Gate->>Gate: softmax -> topk -> normalize
    Gate->>Dispatch: hidden_states, selected_experts [N, top_k], routing_weights
    Dispatch->>Dispatch: moe_permute(index) + all_to_all
    Dispatch->>GU: expert_input [T, H], m_splits
    GU->>Act: gate_up_output [T, 2F]
    Act->>Down: intermediate [T, F]
    Down->>Combine: expert_output [T, H]
    Combine->>Combine: moe_unpermute(index) + reverse all_to_all
    Combine->>Output: weighted sum [N, H]
Loading

Reviews (20): Last reviewed commit: "docs(te_mixtral): add architecture diagr..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated
Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated
@sudhakarsingh27
Copy link
Copy Markdown
Collaborator

Thanks, @faradawn! Also adding @sbhavani to the discussion. Compared to other llama/gemma tutorials, this one seems a quite barebones and looks more like a code example than a tutorial. @sbhavani do you think in its current form, it covers the scope as you requested in #2573?

Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated
Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated
Comment thread docs/examples/te_mixtral/te_mixtral.py Outdated
Comment thread docs/examples/te_mixtral/te_mixtral.py Outdated
Comment thread docs/examples/te_mixtral/utils.py
@faradawn
Copy link
Copy Markdown
Contributor Author

faradawn commented Apr 2, 2026

Hi @sudhakarsingh27 can you check if this addresses your comments? Tested in 2x H100.

@sbhavani
Copy link
Copy Markdown
Collaborator

sbhavani commented Apr 6, 2026

Thanks, @faradawn! Also adding @sbhavani to the discussion. Compared to other llama/gemma tutorials, this one seems a quite barebones and looks more like a code example than a tutorial. @sbhavani do you think in its current form, it covers the scope as you requested in #2573?

Agreed! I think any example should show some perf gain and include the whole weight mapping so the user can run the example.

@pggPL pggPL self-assigned this Apr 13, 2026
@pggPL
Copy link
Copy Markdown
Collaborator

pggPL commented Apr 13, 2026

Documentation build is not working, if you fix it please ping me and I'll review.

Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated
Comment thread docs/examples/te_mixtral/utils.py Outdated
Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated
Comment thread docs/examples/te_mixtral/utils.py Outdated
Comment thread docs/examples/te_mixtral/HANDOFF.md Outdated
Comment thread docs/examples/te_mixtral/te_mixtral.py Outdated
faradawn and others added 11 commits April 21, 2026 11:29
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
- Fix m_splits bug: use (selected_experts == i).sum() instead of
  .any(dim=-1).sum() * top_k, which caused dimension mismatches in
  GroupedLinear
- Add te_mixtral.py: TEMixtralSparseMoeBlock (GroupedLinear + moe_permute/
  unpermute), replace_moe_block context manager, TEMixtralForCausalLM with
  HF weight loading, and replace_params for expert weight packing
- Add utils.py: HyperParameters, data loading, BF16/FP8 model init,
  Accelerate wrapping, fine-tuning loop — mirrors te_llama/utils.py style
- Add requirements.txt matching te_llama versions
- Expand notebook from bare code snippet to full tutorial covering:
  architecture overview, HF vs TE comparison table, unit-test cell,
  baseline BF16 run, BF16 TE improvement, FP8 TE improvement, expert
  routing/scaling discussion, generalisation guide for other MoE models

Addresses reviewer feedback: fixes the critical runtime bug (greptile)
and expands to tutorial quality comparable to the Llama/Gemma examples
(sudhakarsingh27), covering the scope requested in issue NVIDIA#2573.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
for more information, see https://pre-commit.ci

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
A few bugs found during code review:

- moe_permute and moe_unpermute were missing map_type='index'. The
  selected_experts tensor is [num_tokens, top_k] indices, not a mask,
  so without this the routing is completely wrong at runtime.
- num_out_tokens=None should be -1 (the API expects an int).
- moe_unpermute: replaced deprecated probs= with merging_probs= and
  kept routing_weights in float32 as TE recommends.
- utils.py: num_warmup_steps was hardcoded to 100 instead of using
  hyperparams.num_warmup_steps, which made the benchmark meaningless.
- requirements.txt: transformers==4.57.0 doesn't exist, fixed to 4.47.1.
- Notebook generalisation guide: updated code template with the same fixes.

Tested on 2xH100 in nvcr.io/nvidia/pytorch:26.01-py3 (PyTorch 2.10.0, CUDA 13.1):

  $ python3 -c "
  from te_mixtral import TEMixtralSparseMoeBlock
  from transformers import MixtralConfig
  from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
  import torch

  cfg = MixtralConfig(hidden_size=256, intermediate_size=512,
                      num_local_experts=4, num_experts_per_tok=2)
  x = torch.randn(2, 8, cfg.hidden_size, device='cuda', dtype=torch.bfloat16)
  hf_out, hf_logits = MixtralSparseMoeBlock(cfg).cuda().bfloat16()(x)
  te_out, te_logits = TEMixtralSparseMoeBlock(cfg).cuda().bfloat16()(x)
  assert hf_out.shape == te_out.shape
  assert hf_logits.shape == te_logits.shape
  print('PASS')"

  Input  shape : torch.Size([2, 8, 256])
  Output shape : torch.Size([2, 8, 256])  (matches HF: True)
  Logits shape : torch.Size([16, 4])  (matches HF: True)
  Output dtype : torch.bfloat16
  PASS

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
The docs CI was failing because nbsphinx auto mode tries to execute
notebooks with no stored outputs. Add explicit execute:never metadata
so the docs builder renders the notebook as-is without running cells.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
…otebook

te_mixtral.py:
- Replace per-expert .item() loop with torch.bincount + .tolist() (8 GPU syncs
  → 1), eliminating the main cause of GroupedLinear speedup regression
- Remove unnecessary num_out_tokens/max_token_num args from moe_permute
- Fix replace_params to use .copy_() instead of fragile .data[] assignment,
  and load_state_dict from fully-populated te_state in one shot
- Add device_map="auto" support in from_hf_model via accelerate.dispatch_model
- Rename replace_params -> _pack_expert_weights for clarity

tutorial_accelerate_hf_mixtral_with_te.ipynb:
- Remove shape-check section (redundant)
- Add FP8 prefill and decode-regime benchmark cells with summary table
- Restructure training sections to match te_llama pattern: restart at top of
  each cell, combined hyperparams + init + wrap + finetune, concise markdown
- Bump benchmark SEQ 512 -> 2048 for realistic H100 workload

utils.py:
- Apply user batch size and sequence length adjustments

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
for more information, see https://pre-commit.ci

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
faradawn and others added 6 commits April 21, 2026 11:29
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
….10 compat for Unpack

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
- Add padding to next multiple of 32 for MXFP8 in two paths:
  1. MoE expert FFN: pad dispatched tokens before GroupedLinear
  2. THD attention: pad unpacked hidden states and adjust cu_seqlens
- Update notebook to use batch=8, seq=256 with measured results:
  Tier 1 (HF BF16): 603ms, Tier 2 (NCCL EP): 365ms,
  Tier 3 (DeepEP): 357ms, Tier 4 (MXFP8): 349ms
- Add HANDOFF.md with environment setup, run commands, and known issues

Made-with: Cursor
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
- Update notebook to use batch=8/seq=256 with correct measured results
- Add note recommending Megatron-Core for 4K+ sequence lengths
- Add presentation.html with 13-slide technical overview
- Update HANDOFF.md with latest benchmark tables

Signed-off-by: Faradawn Yang <faradawny@nvidia.com>
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
for more information, see https://pre-commit.ci

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
- nbsphinx: skip executing the Mixtral tutorial during docs build; cells require
  torch + 8 GPUs which the docs runner does not have.
- te_mixtral.py: import Unpack from typing_extensions unconditionally so vermin
  does not flag Python 3.11 usage under the repo's 3.10 target.
- te_mixtral.py: derive padded_seq_len from hidden_states, not input_ids, since
  input_ids is None when the caller passes inputs_embeds directly.
- Tutorial notebook: align code-cell batch-size comments with the Python below,
  add explicit 'run via torchrun, not Jupyter' notes to Tier 2/3/4 cells, and
  add a References section pointing to the BioNeMo Mixtral recipes.
- Remove scratch files presentation.html and example_xy_algorithm.py.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Comment thread docs/examples/te_mixtral/utils.py
Sphinx was run with -W (warnings as errors) and flagged the Mixtral
notebook as "document isn't included in any toctree", which failed the
docs Build job. Add the notebook alongside the existing Llama and Gemma
tutorials in the "Examples and Tutorials" toctree in docs/index.rst.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Comment thread docs/examples/te_mixtral/requirements.txt Outdated
Accelerate validates log_with at construction time and raises
"log_with=wandb requires wandb to be installed" when wandb is absent.
The tutorial never calls accelerator.log() or init_trackers(), so the
tracker was dead wiring. Drop it instead of adding wandb as a new
requirement a tutorial user would never exercise.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Comment thread docs/examples/te_mixtral/te_mixtral.py
…ming

te_mixtral.py: the InferenceParams branch dereferenced input_ids.shape,
which crashes when the caller supplies inputs_embeds directly (the XOR
guard at the top of forward allows that path). Fall back to inputs_embeds
as the shape reference and compare shape[:2] so the check works for both
2D input_ids and 3D inputs_embeds.

utils.py: replace the single-pair CUDA event around the whole training
loop with per-step events. Each step's elapsed time is printed live from
rank 0 as "[step N/total] XXXX ms", and the final summary reports a
steady-state estimate (median of the last 5 steps) instead of the old
mean over a 3-step burst. The old 3-sample mean was sensitive to first-
step transients (cuBLAS heuristic warmup, allocator growth, NCCL plan
caching) and produced a non-monotonic Tier-2 batch-size curve.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
@faradawn
Copy link
Copy Markdown
Contributor Author

Hi @pggPL I've fixed the doc build and refined the content - can you help check?

Comment thread docs/examples/te_mixtral/requirements.txt
Comment thread docs/index.rst
@pggPL
Copy link
Copy Markdown
Collaborator

pggPL commented Apr 22, 2026

I was not reviewing it in detail, I would like to discuss on high-level flow of the tutorial first.
I see some issues:

  1. We use GroupedLinear with EP=8 and 8 experts, so 1 GroupedLinear is essentially having 1 expert, so it does not differ from using standard linear. Can we do EP=2 with 4 experts per each GPU or something like that.
  2. We use different types of parallelism in different tiers and it was not clear to me why. Maybe we can also use EP in HF if it is supported.
  3. The gains from Tier 1 -> Tier 2 are very huge, but Tier 2 -> Tier 3 -> Tier 4 are very small. Do you know why?

Apart of that it would be nice to prepare documentation (not tutorial) explaining why grouped linear is needed and which permute kernels we provide. I will work on that in different PR and it will land in Features section in our docs.

Pin transformers, accelerate, datasets, safetensors, huggingface_hub, and
tokenizers in requirements.txt to the exact versions validated on 8x B300
inside NGC pytorch-25.12-py3. torch, torchao, and transformer_engine are
left unpinned because they come from the NGC container and pinning them
would tie the tutorial to a specific container build.

Add a setup-cell note in the tutorial notebook calling out the tested
GPU (8x Blackwell B300) and container (pytorch-25.12-py3).

Addresses pggPL review comments on requirements.txt:
- "pin the versions of some of the frameworks"
- "specify GPU and maybe container to make sure the users can rerun it"

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Comment on lines +847 to +855
if isinstance(past_key_values, InferenceParams):
# input_ids is None when the caller supplies inputs_embeds directly.
_ref = input_ids if input_ids is not None else inputs_embeds
lengths = (
attention_mask.sum(dim=1).tolist()
if attention_mask.shape[:2] == _ref.shape[:2]
else [1] * _ref.shape[0]
)
past_key_values.pre_step(OrderedDict(zip(list(range(len(lengths))), lengths)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 attention_mask is None crash inside InferenceParams branch

attention_mask is an optional parameter with no None-guard before line 852. In the decode step of model.generate() the mask is frequently omitted, and attention_mask.shape[:2] will raise AttributeError: 'NoneType' object has no attribute 'shape', crashing every auto-regressive generation call.

if isinstance(past_key_values, InferenceParams):
    _ref = input_ids if input_ids is not None else inputs_embeds
    lengths = (
        attention_mask.sum(dim=1).tolist()
        if attention_mask is not None and attention_mask.shape[:2] == _ref.shape[:2]
        else [1] * _ref.shape[0]
    )
    past_key_values.pre_step(OrderedDict(zip(list(range(len(lengths))), lengths)))

faradawn and others added 2 commits April 28, 2026 09:26
Adds a configurable expert FFN execution mode to NVMixtralSparseMoeBlock:
  - "grouped" (default): existing TE GroupedLinear path
  - "loop":   Python loop over experts with one F.linear per expert
              (HF-style), running against the same stacked weight tensor
              so swapping between modes is a runtime config flag, no
              checkpoint rework

Adds a new "Tier 2" to the tutorial's progressive optimization chain that
isolates EP+TE primitives without GroupedLinear, so the GroupedLinear
contribution can be measured independently. Renumbers run_finetune_ep.py:

  1 = HF baseline BF16 (device_map="auto")
  2 = TE EP BF16 -- Python loop, F.linear per expert  [new]
  3 = TE EP BF16 -- GroupedLinear                     (was Tier 2)
  4 = TE EP BF16 -- GroupedLinear + Fused DeepEP      (was Tier 3)
  5 = TE EP FP8  -- Float8CurrentScaling + grouped + DeepEP

Tier 5 switches the FP8 recipe from MXFP8BlockScaling to
Float8CurrentScaling -- the per-tensor scaling avoids the 32-aligned
tile constraint that MXFP8 imposes on grouped-matmul, so Tier 5 runs
cleanly at EP < num_experts where MXFP8 trips the per-expert padding
assertion.

Also threads the configuration through the harness:
  - HyperParameters.expert_ffn_mode
  - --ep-size default changed from 8 to 2 (4 experts/rank gives a
    non-degenerate loop-vs-grouped contrast; pass --ep-size 8 for the
    original 1-expert/rank config)
  - Allow WORLD_SIZE > EP via a 2D (DP, EP) DeviceMesh so EP=2 with 8
    GPUs (DP=4) works end-to-end
  - replace_params: use EP-rank-within-group (rank % ep_size) instead of
    global rank for the expert slicing -- otherwise ranks beyond ep_size
    slice past the end of the HF stacked weight tensor

Misc: cycle the dataloader (small benchmark datasets like
openassistant-guanaco exhaust at large batch * num_steps), and trim the
fine-tune summary output to median + last-step (was mean + median +
min/max/p99).

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Comment on lines +511 to +520
"""
gate_up_w = self.experts_gate_up_weight
if isinstance(gate_up_w, DTensor):
gate_up_w = gate_up_w.to_local()
for i in range(self.num_local_experts):
object.__setattr__(self.experts_gate_up, f"weight{i}", gate_up_w[i])

down_w = self.experts_down_weight
if isinstance(down_w, DTensor):
down_w = down_w.to_local()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Wrong DTensor check — GroupedLinear receives DTensor views under EP

self.experts_gate_up_weight is an nn.Parameter. After set_ep_group() wraps it with nn.Parameter(DTensor.from_local(...)), the parameter's type is still nn.Parameter (not DTensor), so isinstance(gate_up_w, DTensor) is always False. The guard never fires, to_local() is never called, and every subsequent gate_up_w[i] slice handed to object.__setattr__ is a DTensor. TE's GroupedLinear CUDA kernels cannot accept DTensor views and will either raise a type error or silently use incorrect data.

The _expert_ffn loop path already applies the correct pattern with .data:

if isinstance(gate_up_w.data, DTensor):
    gate_up_w = gate_up_w.data.to_local()
Suggested change
"""
gate_up_w = self.experts_gate_up_weight
if isinstance(gate_up_w, DTensor):
gate_up_w = gate_up_w.to_local()
for i in range(self.num_local_experts):
object.__setattr__(self.experts_gate_up, f"weight{i}", gate_up_w[i])
down_w = self.experts_down_weight
if isinstance(down_w, DTensor):
down_w = down_w.to_local()
gate_up_w = self.experts_gate_up_weight
if isinstance(gate_up_w.data, DTensor):
gate_up_w = gate_up_w.data.to_local()
for i in range(self.num_local_experts):
object.__setattr__(self.experts_gate_up, f"weight{i}", gate_up_w[i])
down_w = self.experts_down_weight
if isinstance(down_w.data, DTensor):
down_w = down_w.data.to_local()
for i in range(self.num_local_experts):
object.__setattr__(self.experts_down, f"weight{i}", down_w[i])

faradawn and others added 2 commits April 30, 2026 08:04
Adds EXPERIMENT_LOG.md with the GroupedLinear vs naive-loop investigation
(EP=2 batch sweep, batch=8 nsys + NVTX per-phase breakdown) and the
my_expert_ffn_{loop|grouped} / my_grouped_linear_{gate_up,down} NVTX
ranges that produced the per-phase numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +643 to +648
remainder = orig_num_tokens % 32
if remainder != 0:
pad_count = 32 - remainder
expert_input = torch.nn.functional.pad(expert_input, (0, 0, 0, pad_count))
tokens_per_expert = list(tokens_per_expert)
tokens_per_expert[-1] += pad_count
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Padding appended to the last expert's slot regardless of which expert has remaining tokens

When orig_num_tokens % 32 != 0, pad_count tokens are appended to expert_input and tokens_per_expert[-1] is incremented by pad_count. This is only correct if the padded tokens happen to logically belong to the last expert. If the last expert already has 0 tokens on this rank (a routine EP scenario), GroupedLinear will schedule a GEMM for pad_count zero-padded tokens against that expert's weights, inflating compute and potentially triggering a misaligned-size error rather than a clean no-op. A safer approach is to distribute the padding tokens as a trailing dummy entry (tokens_per_expert.append(pad_count)) or to pad at batch-preparation time before dispatch so padding is always aligned with a real expert slot.

@faradawn
Copy link
Copy Markdown
Contributor Author

Will wait on this PR: #2923

…p_proj/down_proj naming

Adds three drawio + SVG diagram pairs under docs/examples/te_mixtral/media/:
- dense_to_sparse: dense Transformer block vs sparse MoE block (router + 8 experts).
- mixtral_decoder_swap: Llama-tutorial-style API key->key map from HF MixtralDecoderLayer
  to TE NVMixtralDecoderLayer (fused QKV, packed expert tensors).
- moe_loop_vs_grouped: zoom into the MoE block; HF per-expert loop vs TE GroupedLinear
  with variable per-expert token counts (m_splits).

Adds a new "Architecture Overview" cell at the top of the tutorial notebook with the
two main figures, and renames the MoE weight references throughout the notebook prose
and tables from the older w1/w2/w3 form to the newer packed mlp.experts.gate_up_proj /
mlp.experts.down_proj keys (the form used by BioNemo's bionemo-recipes/models/mixtral/
convert.py and accepted by replace_params() in te_mixtral.py).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +60 to +85
`/lustre/share/coreai_prod_infbench/exp3_nvtx_per_phase/`.

---

All experiments run on a **SLURM cluster**. The login node has no CUDA, so any
experiment must execute inside an allocation on a compute node with 8 GPUs.

### Reserving a fresh node (8x GPU, 4h, exclusive)

```bash
srun -A coreai_prod_infbench \
-p batch \
-N 1 \
-J myjob \
-t 04:00:00 \
--exclusive \
--mpi=pmix \
--container-image=/lustre/fsw/coreai_prod_infbench/faradawny/docker/pytorch-25.12-py3.sqsh \
--container-save=/lustre/fsw/coreai_prod_infbench/faradawny/docker/pytorch-25.12-py3.sqsh \
--container-name=mycontainer \
--container-mounts=/lustre:/lustre \
--pty bash
```

This drops you into an interactive shell inside the NGC `pytorch-25.12-py3`
container with `/lustre` mounted and the container saved as `mycontainer`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Internal cluster details in a public repository

EXPERIMENT_LOG.md is a personal working-notes file that contains NVIDIA-internal infrastructure details that should not be committed to a public GitHub repo:

  • A developer username embedded in paths (faradawny): /lustre/fsw/coreai_prod_infbench/faradawny/docker/pytorch-25.12-py3.sqsh
  • An internal SLURM account: coreai_prod_infbench
  • Internal shared storage paths: /lustre/share/coreai_prod_infbench/exp3_nvtx_per_phase/

This file looks like a benchmark log intended for internal coordination, not documentation for external tutorial users. Consider removing it entirely or replacing it with a trimmed BENCHMARKS.md that shows only public results (median step times, speedup numbers) without cluster-specific details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants