Add examples for MoE models - Mixtral in TE by faradawn · Pull Request #2642 · NVIDIA/TransformerEngine

faradawn · 2026-02-02T18:06:27Z

Summary

This PR adds a complete tutorial for integrating HuggingFace Mixtral (MoE) with Transformer Engine, addressing the gap identified in #2573.

What's included

te_mixtral.py — Drop-in TEMixtralSparseMoeBlock that replaces HF's loop-over-experts with TE's GroupedLinear (batched GEMM) + moe_permute/moe_unpermute. Includes replace_moe_block context manager, TEMixtralForCausalLM with HF weight loading, and replace_params for expert weight packing.
utils.py — Data loading, BF16/FP8 model init, Accelerate wrapping, fine-tuning loop — mirrors te_llama/utils.py style.
requirements.txt — Pinned dependencies matching the Llama/Gemma tutorials.
Tutorial notebook — Full tutorial matching the quality bar of te_llama and te_gemma, covering:
1. Architecture overview: Transformer → Mixtral MoE, HF bottleneck, TE approach
2. Unit-test cell verifying output shape/dtype against the HF block
3. [Baseline] HF Mixtral in BF16
4. [Improvement 1] TE GroupedLinear MoE in BF16
5. [Improvement 2] TE GroupedLinear MoE in FP8
6. Expert routing considerations with mixed precision (m_splits, per-expert FP8 scaling, aux loss passthrough)
7. Generalisation guide for other MoE architectures (DeepSeek, Grok-1, etc.)

Bug fix

Corrected the m_splits calculation flagged by the automated review:

# Before (wrong): double-counts tokens by reducing with .any() then multiplying by top_k
expert_mask = (selected_experts == expert_idx).any(dim=-1)
m_splits.append(expert_mask.sum().item() * self.top_k)

# After (correct): count the actual number of (token, top_k_slot) pairs per expert
m_splits = [(selected_experts == i).sum().item() for i in range(self.num_experts)]

Scope

Covers all topics requested in #2573:

How to wrap MoE layers with TE modules ✓
FP8 training configuration for MoE ✓
Expert routing considerations with mixed precision ✓
Generalisation to arbitrary MoE architectures ✓

greptile-apps · 2026-02-02T18:09:49Z

Greptile Summary

This PR adds a complete TE Mixtral tutorial with NVMixtralSparseMoeBlock (TE GroupedLinear + moe_permute/moe_unpermute), an AllToAllTokenDispatcher for expert parallelism, optional DeepEP fused dispatch, FP8 support, and a fine-tuning harness. Several issues flagged in prior rounds have been fixed (class renames, wandb dep, collator shipping), but EXPERIMENT_LOG.md — a personal benchmark log containing NVIDIA-internal cluster paths and a developer username — has been added to the public repo and should be removed or sanitised before merge.

Confidence Score: 3/5

Not safe to merge: EXPERIMENT_LOG.md exposes internal NVIDIA cluster details in a public repo, and several critical runtime bugs in te_mixtral.py flagged in earlier review rounds remain unaddressed.

The P1 finding (EXPERIMENT_LOG.md with internal paths) pulls the ceiling to 4, but the overall state of te_mixtral.py — which still carries unresolved P1 bugs from prior rounds (wrong DTensor isinstance check in _sync_expert_views, null-pointer crashes in the InferenceParams branch, incorrect padding slot assignment) — warrants a lower score.

docs/examples/te_mixtral/EXPERIMENT_LOG.md (internal cluster details must be removed), docs/examples/te_mixtral/te_mixtral.py (multiple unresolved runtime bugs from prior review rounds).

Important Files Changed

Filename	Overview
docs/examples/te_mixtral/te_mixtral.py	Core TE Mixtral implementation: NVMixtralSparseMoeBlock, AllToAllTokenDispatcher, weight mapping. Multiple previously-flagged P1 bugs remain unresolved (DTensor check in _sync_expert_views, null-pointer crashes in InferenceParams branch, incorrect token padding slot assignment).
docs/examples/te_mixtral/EXPERIMENT_LOG.md	Personal benchmark log containing NVIDIA-internal cluster paths (SLURM account, developer username, lustre paths). Should not be committed to a public repository.
docs/examples/te_mixtral/utils.py	Training utilities: model init, Accelerate wrapping, fine-tuning loop. Previously-flagged issues (hardcoded warmup steps, missing wandb dep, broken class name import) appear to be addressed in current version.
docs/examples/te_mixtral/collator.py	THD sequence-packing collator (previously shipped as bionemo_mixtral, now local). Logic is correct; replaces the previously-broken external dependency.
docs/examples/te_mixtral/fused_token_router.py	DeepEP-backed TokenDispatcher with correct map_type="mask" for the multihot path and merging_probs kwarg.
docs/examples/te_mixtral/requirements.txt	Pinned HF stack dependencies; wandb (previously missing) no longer needed since log_with was removed from Accelerator call.
docs/examples/te_mixtral/install_hybridep.sh	DeepEP build script with GPU arch detection and NVSHMEM-disable patch. Contains an internal NVIDIA bug tracker URL (nvbugspro.nvidia.com) in a public-facing comment.
docs/examples/te_mixtral/test_accuracy.py	Weight-mapping parity test between HF and TE models. Clean and minimal; correctly filters _extra_state keys from missing list.

Sequence Diagram

sequenceDiagram
    participant Input as Input [N, H]
    participant Gate as Router (gate)
    participant Dispatch as TokenDispatcher<br/>(AllToAll / FusedDeepEP)
    participant GU as GroupedLinear<br/>gate_up [E, 2*F, H]
    participant Act as SiLU x up
    participant Down as GroupedLinear<br/>down [E, H, F]
    participant Combine as Combine<br/>(moe_unpermute + all_to_all)
    participant Output as Output [N, H]

    Input->>Gate: router_logits [N, num_experts]
    Gate->>Gate: softmax -> topk -> normalize
    Gate->>Dispatch: hidden_states, selected_experts [N, top_k], routing_weights
    Dispatch->>Dispatch: moe_permute(index) + all_to_all
    Dispatch->>GU: expert_input [T, H], m_splits
    GU->>Act: gate_up_output [T, 2F]
    Act->>Down: intermediate [T, F]
    Down->>Combine: expert_output [T, H]
    Combine->>Combine: moe_unpermute(index) + reverse all_to_all
    Combine->>Output: weighted sum [N, H]

_{Reviews (20): Last reviewed commit: "docs(te_mixtral): add architecture diagr..." | Re-trigger Greptile}

greptile-apps

_{1 file reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

sudhakarsingh27 · 2026-02-18T20:04:05Z

Thanks, @faradawn! Also adding @sbhavani to the discussion. Compared to other llama/gemma tutorials, this one seems a quite barebones and looks more like a code example than a tutorial. @sbhavani do you think in its current form, it covers the scope as you requested in #2573?

faradawn · 2026-04-02T23:11:10Z

Hi @sudhakarsingh27 can you check if this addresses your comments? Tested in 2x H100.

sbhavani · 2026-04-06T17:27:02Z

Thanks, @faradawn! Also adding @sbhavani to the discussion. Compared to other llama/gemma tutorials, this one seems a quite barebones and looks more like a code example than a tutorial. @sbhavani do you think in its current form, it covers the scope as you requested in #2573?

Agreed! I think any example should show some perf gain and include the whole weight mapping so the user can run the example.

pggPL · 2026-04-13T11:01:24Z

Documentation build is not working, if you fix it please ping me and I'll review.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

- Fix m_splits bug: use (selected_experts == i).sum() instead of .any(dim=-1).sum() * top_k, which caused dimension mismatches in GroupedLinear - Add te_mixtral.py: TEMixtralSparseMoeBlock (GroupedLinear + moe_permute/ unpermute), replace_moe_block context manager, TEMixtralForCausalLM with HF weight loading, and replace_params for expert weight packing - Add utils.py: HyperParameters, data loading, BF16/FP8 model init, Accelerate wrapping, fine-tuning loop — mirrors te_llama/utils.py style - Add requirements.txt matching te_llama versions - Expand notebook from bare code snippet to full tutorial covering: architecture overview, HF vs TE comparison table, unit-test cell, baseline BF16 run, BF16 TE improvement, FP8 TE improvement, expert routing/scaling discussion, generalisation guide for other MoE models Addresses reviewer feedback: fixes the critical runtime bug (greptile) and expands to tutorial quality comparable to the Llama/Gemma examples (sudhakarsingh27), covering the scope requested in issue NVIDIA#2573. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

for more information, see https://pre-commit.ci Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

A few bugs found during code review: - moe_permute and moe_unpermute were missing map_type='index'. The selected_experts tensor is [num_tokens, top_k] indices, not a mask, so without this the routing is completely wrong at runtime. - num_out_tokens=None should be -1 (the API expects an int). - moe_unpermute: replaced deprecated probs= with merging_probs= and kept routing_weights in float32 as TE recommends. - utils.py: num_warmup_steps was hardcoded to 100 instead of using hyperparams.num_warmup_steps, which made the benchmark meaningless. - requirements.txt: transformers==4.57.0 doesn't exist, fixed to 4.47.1. - Notebook generalisation guide: updated code template with the same fixes. Tested on 2xH100 in nvcr.io/nvidia/pytorch:26.01-py3 (PyTorch 2.10.0, CUDA 13.1): $ python3 -c " from te_mixtral import TEMixtralSparseMoeBlock from transformers import MixtralConfig from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock import torch cfg = MixtralConfig(hidden_size=256, intermediate_size=512, num_local_experts=4, num_experts_per_tok=2) x = torch.randn(2, 8, cfg.hidden_size, device='cuda', dtype=torch.bfloat16) hf_out, hf_logits = MixtralSparseMoeBlock(cfg).cuda().bfloat16()(x) te_out, te_logits = TEMixtralSparseMoeBlock(cfg).cuda().bfloat16()(x) assert hf_out.shape == te_out.shape assert hf_logits.shape == te_logits.shape print('PASS')" Input shape : torch.Size([2, 8, 256]) Output shape : torch.Size([2, 8, 256]) (matches HF: True) Logits shape : torch.Size([16, 4]) (matches HF: True) Output dtype : torch.bfloat16 PASS Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

The docs CI was failing because nbsphinx auto mode tries to execute notebooks with no stored outputs. Add explicit execute:never metadata so the docs builder renders the notebook as-is without running cells. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

…otebook te_mixtral.py: - Replace per-expert .item() loop with torch.bincount + .tolist() (8 GPU syncs → 1), eliminating the main cause of GroupedLinear speedup regression - Remove unnecessary num_out_tokens/max_token_num args from moe_permute - Fix replace_params to use .copy_() instead of fragile .data[] assignment, and load_state_dict from fully-populated te_state in one shot - Add device_map="auto" support in from_hf_model via accelerate.dispatch_model - Rename replace_params -> _pack_expert_weights for clarity tutorial_accelerate_hf_mixtral_with_te.ipynb: - Remove shape-check section (redundant) - Add FP8 prefill and decode-regime benchmark cells with summary table - Restructure training sections to match te_llama pattern: restart at top of each cell, combined hyperparams + init + wrap + finetune, concise markdown - Bump benchmark SEQ 512 -> 2048 for realistic H100 workload utils.py: - Apply user batch size and sequence length adjustments Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

for more information, see https://pre-commit.ci Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

….10 compat for Unpack Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

- Add padding to next multiple of 32 for MXFP8 in two paths: 1. MoE expert FFN: pad dispatched tokens before GroupedLinear 2. THD attention: pad unpacked hidden states and adjust cu_seqlens - Update notebook to use batch=8, seq=256 with measured results: Tier 1 (HF BF16): 603ms, Tier 2 (NCCL EP): 365ms, Tier 3 (DeepEP): 357ms, Tier 4 (MXFP8): 349ms - Add HANDOFF.md with environment setup, run commands, and known issues Made-with: Cursor Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

- Update notebook to use batch=8/seq=256 with correct measured results - Add note recommending Megatron-Core for 4K+ sequence lengths - Add presentation.html with 13-slide technical overview - Update HANDOFF.md with latest benchmark tables Signed-off-by: Faradawn Yang <faradawny@nvidia.com> Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> Made-with: Cursor Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

for more information, see https://pre-commit.ci Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

- nbsphinx: skip executing the Mixtral tutorial during docs build; cells require torch + 8 GPUs which the docs runner does not have. - te_mixtral.py: import Unpack from typing_extensions unconditionally so vermin does not flag Python 3.11 usage under the repo's 3.10 target. - te_mixtral.py: derive padded_seq_len from hidden_states, not input_ids, since input_ids is None when the caller passes inputs_embeds directly. - Tutorial notebook: align code-cell batch-size comments with the Python below, add explicit 'run via torchrun, not Jupyter' notes to Tier 2/3/4 cells, and add a References section pointing to the BioNeMo Mixtral recipes. - Remove scratch files presentation.html and example_xy_algorithm.py. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

Sphinx was run with -W (warnings as errors) and flagged the Mixtral notebook as "document isn't included in any toctree", which failed the docs Build job. Add the notebook alongside the existing Llama and Gemma tutorials in the "Examples and Tutorials" toctree in docs/index.rst. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

greptile-apps · 2026-04-21T22:29:36Z

Want your agent to iterate on Greptile's feedback? Try greploops.

Accelerate validates log_with at construction time and raises "log_with=wandb requires wandb to be installed" when wandb is absent. The tutorial never calls accelerator.log() or init_trackers(), so the tracker was dead wiring. Drop it instead of adding wandb as a new requirement a tutorial user would never exercise. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

…ming te_mixtral.py: the InferenceParams branch dereferenced input_ids.shape, which crashes when the caller supplies inputs_embeds directly (the XOR guard at the top of forward allows that path). Fall back to inputs_embeds as the shape reference and compare shape[:2] so the check works for both 2D input_ids and 3D inputs_embeds. utils.py: replace the single-pair CUDA event around the whole training loop with per-step events. Each step's elapsed time is printed live from rank 0 as "[step N/total] XXXX ms", and the final summary reports a steady-state estimate (median of the last 5 steps) instead of the old mean over a 3-step burst. The old 3-sample mean was sensitive to first- step transients (cuBLAS heuristic warmup, allocator growth, NCCL plan caching) and produced a non-monotonic Tier-2 batch-size curve. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

faradawn · 2026-04-22T00:12:32Z

Hi @pggPL I've fixed the doc build and refined the content - can you help check?

pggPL · 2026-04-22T11:11:47Z

I was not reviewing it in detail, I would like to discuss on high-level flow of the tutorial first.
I see some issues:

We use GroupedLinear with EP=8 and 8 experts, so 1 GroupedLinear is essentially having 1 expert, so it does not differ from using standard linear. Can we do EP=2 with 4 experts per each GPU or something like that.
We use different types of parallelism in different tiers and it was not clear to me why. Maybe we can also use EP in HF if it is supported.
The gains from Tier 1 -> Tier 2 are very huge, but Tier 2 -> Tier 3 -> Tier 4 are very small. Do you know why?

Apart of that it would be nice to prepare documentation (not tutorial) explaining why grouped linear is needed and which permute kernels we provide. I will work on that in different PR and it will land in Features section in our docs.

Pin transformers, accelerate, datasets, safetensors, huggingface_hub, and tokenizers in requirements.txt to the exact versions validated on 8x B300 inside NGC pytorch-25.12-py3. torch, torchao, and transformer_engine are left unpinned because they come from the NGC container and pinning them would tie the tutorial to a specific container build. Add a setup-cell note in the tutorial notebook calling out the tested GPU (8x Blackwell B300) and container (pytorch-25.12-py3). Addresses pggPL review comments on requirements.txt: - "pin the versions of some of the frameworks" - "specify GPU and maybe container to make sure the users can rerun it" Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

greptile-apps · 2026-04-22T23:06:05Z

+        if isinstance(past_key_values, InferenceParams):
+            # input_ids is None when the caller supplies inputs_embeds directly.
+            _ref = input_ids if input_ids is not None else inputs_embeds
+            lengths = (
+                attention_mask.sum(dim=1).tolist()
+                if attention_mask.shape[:2] == _ref.shape[:2]
+                else [1] * _ref.shape[0]
+            )
+            past_key_values.pre_step(OrderedDict(zip(list(range(len(lengths))), lengths)))


attention_mask is None crash inside InferenceParams branch

attention_mask is an optional parameter with no None-guard before line 852. In the decode step of model.generate() the mask is frequently omitted, and attention_mask.shape[:2] will raise AttributeError: 'NoneType' object has no attribute 'shape', crashing every auto-regressive generation call.

if isinstance(past_key_values, InferenceParams): _ref = input_ids if input_ids is not None else inputs_embeds lengths = ( attention_mask.sum(dim=1).tolist() if attention_mask is not None and attention_mask.shape[:2] == _ref.shape[:2] else [1] * _ref.shape[0] ) past_key_values.pre_step(OrderedDict(zip(list(range(len(lengths))), lengths)))

Adds a configurable expert FFN execution mode to NVMixtralSparseMoeBlock: - "grouped" (default): existing TE GroupedLinear path - "loop": Python loop over experts with one F.linear per expert (HF-style), running against the same stacked weight tensor so swapping between modes is a runtime config flag, no checkpoint rework Adds a new "Tier 2" to the tutorial's progressive optimization chain that isolates EP+TE primitives without GroupedLinear, so the GroupedLinear contribution can be measured independently. Renumbers run_finetune_ep.py: 1 = HF baseline BF16 (device_map="auto") 2 = TE EP BF16 -- Python loop, F.linear per expert [new] 3 = TE EP BF16 -- GroupedLinear (was Tier 2) 4 = TE EP BF16 -- GroupedLinear + Fused DeepEP (was Tier 3) 5 = TE EP FP8 -- Float8CurrentScaling + grouped + DeepEP Tier 5 switches the FP8 recipe from MXFP8BlockScaling to Float8CurrentScaling -- the per-tensor scaling avoids the 32-aligned tile constraint that MXFP8 imposes on grouped-matmul, so Tier 5 runs cleanly at EP < num_experts where MXFP8 trips the per-expert padding assertion. Also threads the configuration through the harness: - HyperParameters.expert_ffn_mode - --ep-size default changed from 8 to 2 (4 experts/rank gives a non-degenerate loop-vs-grouped contrast; pass --ep-size 8 for the original 1-expert/rank config) - Allow WORLD_SIZE > EP via a 2D (DP, EP) DeviceMesh so EP=2 with 8 GPUs (DP=4) works end-to-end - replace_params: use EP-rank-within-group (rank % ep_size) instead of global rank for the expert slicing -- otherwise ranks beyond ep_size slice past the end of the HF stacked weight tensor Misc: cycle the dataloader (small benchmark datasets like openassistant-guanaco exhaust at large batch * num_steps), and trim the fine-tune summary output to median + last-step (was mean + median + min/max/p99). Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-28T16:42:04Z

+        """
+        gate_up_w = self.experts_gate_up_weight
+        if isinstance(gate_up_w, DTensor):
+            gate_up_w = gate_up_w.to_local()
+        for i in range(self.num_local_experts):
+            object.__setattr__(self.experts_gate_up, f"weight{i}", gate_up_w[i])
+
+        down_w = self.experts_down_weight
+        if isinstance(down_w, DTensor):
+            down_w = down_w.to_local()


Wrong DTensor check — GroupedLinear receives DTensor views under EP

self.experts_gate_up_weight is an nn.Parameter. After set_ep_group() wraps it with nn.Parameter(DTensor.from_local(...)), the parameter's type is still nn.Parameter (not DTensor), so isinstance(gate_up_w, DTensor) is always False. The guard never fires, to_local() is never called, and every subsequent gate_up_w[i] slice handed to object.__setattr__ is a DTensor. TE's GroupedLinear CUDA kernels cannot accept DTensor views and will either raise a type error or silently use incorrect data.

The _expert_ffn loop path already applies the correct pattern with .data:

if isinstance(gate_up_w.data, DTensor): gate_up_w = gate_up_w.data.to_local()

Suggested change

"""

gate_up_w = self.experts_gate_up_weight

if isinstance(gate_up_w, DTensor):

gate_up_w = gate_up_w.to_local()

for i in range(self.num_local_experts):

object.__setattr__(self.experts_gate_up, f"weight{i}", gate_up_w[i])

down_w = self.experts_down_weight

if isinstance(down_w, DTensor):

down_w = down_w.to_local()

gate_up_w = self.experts_gate_up_weight

if isinstance(gate_up_w.data, DTensor):

gate_up_w = gate_up_w.data.to_local()

for i in range(self.num_local_experts):

object.__setattr__(self.experts_gate_up, f"weight{i}", gate_up_w[i])

down_w = self.experts_down_weight

if isinstance(down_w.data, DTensor):

down_w = down_w.data.to_local()

for i in range(self.num_local_experts):

object.__setattr__(self.experts_down, f"weight{i}", down_w[i])

Adds EXPERIMENT_LOG.md with the GroupedLinear vs naive-loop investigation (EP=2 batch sweep, batch=8 nsys + NVTX per-phase breakdown) and the my_expert_ffn_{loop|grouped} / my_grouped_linear_{gate_up,down} NVTX ranges that produced the per-phase numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-30T15:20:34Z

+        remainder = orig_num_tokens % 32
+        if remainder != 0:
+            pad_count = 32 - remainder
+            expert_input = torch.nn.functional.pad(expert_input, (0, 0, 0, pad_count))
+            tokens_per_expert = list(tokens_per_expert)
+            tokens_per_expert[-1] += pad_count


Padding appended to the last expert's slot regardless of which expert has remaining tokens

When orig_num_tokens % 32 != 0, pad_count tokens are appended to expert_input and tokens_per_expert[-1] is incremented by pad_count. This is only correct if the padded tokens happen to logically belong to the last expert. If the last expert already has 0 tokens on this rank (a routine EP scenario), GroupedLinear will schedule a GEMM for pad_count zero-padded tokens against that expert's weights, inflating compute and potentially triggering a misaligned-size error rather than a clean no-op. A safer approach is to distribute the padding tokens as a trailing dummy entry (tokens_per_expert.append(pad_count)) or to pad at batch-preparation time before dispatch so padding is always aligned with a real expert slot.

faradawn · 2026-04-30T21:22:15Z

Will wait on this PR: #2923

…p_proj/down_proj naming Adds three drawio + SVG diagram pairs under docs/examples/te_mixtral/media/: - dense_to_sparse: dense Transformer block vs sparse MoE block (router + 8 experts). - mixtral_decoder_swap: Llama-tutorial-style API key->key map from HF MixtralDecoderLayer to TE NVMixtralDecoderLayer (fused QKV, packed expert tensors). - moe_loop_vs_grouped: zoom into the MoE block; HF per-expert loop vs TE GroupedLinear with variable per-expert token counts (m_splits). Adds a new "Architecture Overview" cell at the top of the tutorial notebook with the two main figures, and renames the MoE weight references throughout the notebook prose and tables from the older w1/w2/w3 form to the newer packed mlp.experts.gate_up_proj / mlp.experts.down_proj keys (the form used by BioNemo's bionemo-recipes/models/mixtral/ convert.py and accepted by replace_params() in te_mixtral.py). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-05-02T01:26:11Z

+`/lustre/share/coreai_prod_infbench/exp3_nvtx_per_phase/`.
+
+---
+
+All experiments run on a **SLURM cluster**. The login node has no CUDA, so any
+experiment must execute inside an allocation on a compute node with 8 GPUs.
+
+### Reserving a fresh node (8x GPU, 4h, exclusive)
+
+```bash
+srun -A coreai_prod_infbench \
+     -p batch \
+     -N 1 \
+     -J myjob \
+     -t 04:00:00 \
+     --exclusive \
+     --mpi=pmix \
+     --container-image=/lustre/fsw/coreai_prod_infbench/faradawny/docker/pytorch-25.12-py3.sqsh \
+     --container-save=/lustre/fsw/coreai_prod_infbench/faradawny/docker/pytorch-25.12-py3.sqsh \
+     --container-name=mycontainer \
+     --container-mounts=/lustre:/lustre \
+     --pty bash
+```
+
+This drops you into an interactive shell inside the NGC `pytorch-25.12-py3`
+container with `/lustre` mounted and the container saved as `mycontainer`.


Internal cluster details in a public repository

EXPERIMENT_LOG.md is a personal working-notes file that contains NVIDIA-internal infrastructure details that should not be committed to a public GitHub repo:

A developer username embedded in paths (faradawny): /lustre/fsw/coreai_prod_infbench/faradawny/docker/pytorch-25.12-py3.sqsh

An internal SLURM account: coreai_prod_infbench

Internal shared storage paths: /lustre/share/coreai_prod_infbench/exp3_nvtx_per_phase/

This file looks like a benchmark log intended for internal coordination, not documentation for external tutorial users. Consider removing it entirely or replacing it with a trimmed BENCHMARKS.md that shows only public results (median step times, speedup numbers) without cluster-specific details.

greptile-apps Bot reviewed Feb 2, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated

Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated

ptrendx requested review from pggPL and sudhakarsingh27 February 3, 2026 01:48

greptile-apps Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated

Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated

greptile-apps Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/te_mixtral.py Outdated

Comment thread docs/examples/te_mixtral/te_mixtral.py Outdated

Comment thread docs/examples/te_mixtral/utils.py

pggPL self-assigned this Apr 13, 2026

greptile-apps Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated

greptile-apps Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/utils.py Outdated

greptile-apps Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/tutorial_accelerate_hf_mixtral_with_te.ipynb Outdated

greptile-apps Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/utils.py Outdated

faradawn force-pushed the add-moe-example branch from d7301a6 to fcc7e35 Compare April 20, 2026 18:26

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/HANDOFF.md Outdated

faradawn force-pushed the add-moe-example branch from 72e0e45 to 226173a Compare April 20, 2026 22:46

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/te_mixtral.py Outdated

faradawn and others added 11 commits April 21, 2026 11:29

Add Mixtral MoE integration tutorial with TE GroupedLinear

456458e

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8a5ee2

for more information, see https://pre-commit.ci Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

83358cb

for more information, see https://pre-commit.ci Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

add memory calculation

c3c2e18

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

docs(te_mixtral): rename example script to test_accuracy

3497ebb

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

add replace param

37bb13f

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

smoke test of ep worked

cbddd45

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

faradawn and others added 6 commits April 21, 2026 11:29

docs(te_mixtral): add collator and update tutorial utils

f8b8a81

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

fix: update copyright headers to match repo standard and add Python 3…

4f49eaf

….10 compat for Unpack Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ecf66e9

for more information, see https://pre-commit.ci Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

faradawn force-pushed the add-moe-example branch from c3a50c1 to 590814a Compare April 21, 2026 18:29

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/utils.py

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/requirements.txt Outdated

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/te_mixtral.py

pggPL reviewed Apr 22, 2026

View reviewed changes

Comment thread docs/examples/te_mixtral/requirements.txt

pggPL reviewed Apr 22, 2026

View reviewed changes

Comment thread docs/index.rst

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

faradawn and others added 2 commits April 28, 2026 09:26

[pre-commit.ci] auto fixes from pre-commit.com hooks

2449f1a

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

faradawn mentioned this pull request Apr 28, 2026

GroupedLinear performance on Mixtral-8x7B at EP=2 #2939

Open

faradawn and others added 2 commits April 30, 2026 08:04

[pre-commit.ci] auto fixes from pre-commit.com hooks

6a69bed

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

greptile-apps Bot reviewed May 2, 2026

View reviewed changes

Conversation

faradawn commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Bug fix

Scope

Uh oh!

greptile-apps Bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

faradawn commented Apr 2, 2026

Uh oh!

sbhavani commented Apr 6, 2026

Uh oh!

pggPL commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Uh oh!

faradawn commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

pggPL commented Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

faradawn commented Apr 30, 2026

Uh oh!

greptile-apps Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

faradawn commented Feb 2, 2026 •

edited

Loading

greptile-apps Bot commented Feb 2, 2026 •

edited

Loading