Add Gemma 4 text-decoder export to CoreML by john-rocky · Pull Request #19253 · pytorch/executorch

john-rocky · 2026-05-01T06:03:17Z

Summary

The Gemma 4 text decoder shipped with examples/models/gemma4/text_decoder/
already implements hybrid sliding/full attention, partial RoPE,
per-layer head_dim (256 for sliding / 512 for full), MQA, and YOCO
KV sharing in plain PyTorch.

I checked, and that implementation lowers cleanly through
torch.export and CoreMLPartitioner today — for the synthetic
10-layer Gemma 4 used in the new test, the lowered edge program
contains exactly executorch_call_delegate and getitem at the top
level (1186 MIL ops fully delegated). No portable fallbacks, no
unsupported ops.

So the missing piece is not new modeling code — it is the small amount
of glue that turns "exportable in principle" into "exportable from one
shell command". This PR adds that glue:

examples/apple/coreml/gemma4/export_gemma4_text_decoder_coreml.py,
with sensible CoreML defaults: iOS18+ deployment target so the
YOCO KV caches can be taken over as stateful tensors,
compute_unit=CPU_AND_NE, fp16 by default (the ANE requires fp16).
A --random_weights mode for smoke-testing the export pipeline
without a HuggingFace checkpoint, plus --config_json,
--sliding_window, --sliding_window_pattern overrides.
A readme.md documenting the flags and the "everything delegates"
property.
A BUCK target so the script is buildable in fbcode the same way
the existing CoreML llama scripts are.

The audio and vision encoders are intentionally out of scope — the
existing ATen pipeline in examples/models/gemma4 is more appropriate
for those.

Test plan

examples/apple/coreml/gemma4/test.py builds a 10-layer synthetic
Gemma 4 (4 sliding + 1 full × 2) — same hybrid pattern as Gemma 4 E2B,
just at smaller dimensions — and runs the full export pipeline,
asserting the resulting .pte is non-empty.

$ python -m pytest examples/apple/coreml/gemma4/test.py -v
test.py::TestGemma4CoreMLExport::test_eager_forward_runs PASSED
test.py::TestGemma4CoreMLExport::test_full_export_pipeline_lowers_to_coreml PASSED
============================== 2 passed in 15.32s ==============================

I also ran the export by hand and confirmed the resulting edge program
is fully delegated.

Relationship to other open PRs

Add --sliding_window flag to CoreML static LLM export #19250 / Add per-layer hybrid sliding/full attention (Gemma 3 / Gemma 4) to CoreML static LLM export #19251 add --sliding_window / --sliding_window_pattern
for the static-LLM Llama path. Gemma 4's text decoder uses a
different attention implementation (per-layer head_dim, partial
RoPE, etc.) that already understands those concepts via Gemma4Config,
so this PR doesn't depend on those — it just plumbs the equivalent
overrides through to Gemma4Config directly.
Add coreml_compute_plan.py: report which CoreML ops dispatch to ANE / GPU / CPU #19252 adds coreml_compute_plan.py, which is the natural next step
for tuning a Gemma 4 export: run it against the produced .pte to
see which ops the runtime would dispatch to the ANE vs the CPU.

Authored with Claude.

The Gemma 4 text decoder shipped with examples/models/gemma4 already implements hybrid sliding/full attention, partial RoPE, per-layer head_dim, MQA, and YOCO KV sharing in plain PyTorch. That implementation lowers cleanly through torch.export and CoreMLPartitioner — every node in the resulting edge program is a single executorch_call_delegate and a getitem. This script wires up the small amount of glue needed for an on-device-friendly default: * compile_specs targeting iOS18+ so the YOCO KV caches can be taken over as stateful tensors. * fp16 by default (the ANE requires fp16). * compute_unit=CPU_AND_NE so the runtime is free to keep ops on the ANE. * Optional --random_weights mode for smoke-testing the export without a HuggingFace checkpoint, plus --config_json / --sliding_window / --sliding_window_pattern overrides. Audio and vision encoders are intentionally out of scope here — the existing ATen pipeline in examples/models/gemma4 is more appropriate for those. ### Test plan `test.py` builds a 10-layer synthetic Gemma 4 (4 sliding + 1 full × 2) and runs the full export pipeline, asserting the resulting .pte exists. $ python -m pytest examples/apple/coreml/gemma4/test.py -v test.py::TestGemma4CoreMLExport::test_eager_forward_runs PASSED test.py::TestGemma4CoreMLExport::test_full_export_pipeline_lowers_to_coreml PASSED ============================== 2 passed in 15.32s ============================== I also ran the export by hand against the synthetic config and confirmed the lowered edge program contains only `executorch_call_delegate` and `getitem` at the top level. Authored with Claude.

pytorch-bot · 2026-05-01T06:03:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19253

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⚠️ 11 Awaiting Approval

As of commit 4efa007 with merge base 94d2881 ():

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shoumikhin · 2026-05-02T15:29:40Z

Thanks @john-rocky, really appreciate the CoreML batch. Linking the related PRs in this stack so reviewers can see the full picture: #19245, #19246, #19247, #19248, #19249, #19250, #19251, #19252.

@metascroy you're already on this one. Would you mind taking a sweep across the stack, or should we pull in another CoreML reviewer?

john-rocky · 2026-05-02T15:37:43Z

Thanks @shoumikhin! Quick orientation for whoever does the sweep:

Reject CoreML delegation for unsupported input dtypes #19245–19249 are five independent partitioner / DX fixes touching coreml_partitioner.py, torch_ops.py, and the partition tests. Each one stands alone; merge order does not matter.
Add --sliding_window flag to CoreML static LLM export #19250 → Add per-layer hybrid sliding/full attention (Gemma 3 / Gemma 4) to CoreML static LLM export #19251 is the only stack: Add per-layer hybrid sliding/full attention (Gemma 3 / Gemma 4) to CoreML static LLM export #19251 was branched on top of Add --sliding_window flag to CoreML static LLM export #19250's commit, so once Add --sliding_window flag to CoreML static LLM export #19250 lands, Add per-layer hybrid sliding/full attention (Gemma 3 / Gemma 4) to CoreML static LLM export #19251's diff collapses to just the per-layer commit (9bdf04e). I'll rebase as soon as that happens.
Add coreml_compute_plan.py: report which CoreML ops dispatch to ANE / GPU / CPU #19252 (compute-plan analyzer) and this PR (Add Gemma 4 text-decoder export to CoreML #19253, Gemma 4) are standalone.

All nine have unit tests I ran on macOS 26 / Python 3.10 / coremltools 9.0; the test plan section in each PR body has the local pytest output.

Happy to split, squash, retitle, or release notes: label any of them if that helps land the batch faster — let me know what's most useful.

john-rocky · 2026-05-02T15:37:49Z

@pytorchbot label "release notes: apple"

john-rocky requested a review from metascroy as a code owner May 1, 2026 06:03

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2026

pytorch-bot Bot added the release notes: apple Changes to the Apple backend delegate label May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma 4 text-decoder export to CoreML#19253

Add Gemma 4 text-decoder export to CoreML#19253
john-rocky wants to merge 1 commit intopytorch:mainfrom
john-rocky:coreml/gemma4-text-decoder

john-rocky commented May 1, 2026

Uh oh!

pytorch-bot Bot commented May 1, 2026 •

edited

Loading

Uh oh!

shoumikhin commented May 2, 2026

Uh oh!

john-rocky commented May 2, 2026

Uh oh!

john-rocky commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

john-rocky commented May 1, 2026

Summary

Test plan

Relationship to other open PRs

Uh oh!

pytorch-bot Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19253

⚠️ 11 Awaiting Approval

Uh oh!

shoumikhin commented May 2, 2026

Uh oh!

john-rocky commented May 2, 2026

Uh oh!

john-rocky commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot Bot commented May 1, 2026 •

edited

Loading