Skip to content

Good First Issue: isolate the MLX submodule build (MLX backend) #20556

Description

@metascroy

Problem

MLX is a git submodule (backends/mlx/third-party/mlx) pulled into ET's build
with add_subdirectory, which drops MLX's whole CMake project into ET's own
target/option namespace
:

# backends/mlx/CMakeLists.txt:239-242
add_subdirectory(${MLX_SOURCE_DIR} ${CMAKE_CURRENT_BINARY_DIR}/mlx)

Sharing the namespace causes a collision: MLX's upstream
FetchContent_MakeAvailable(json) clashes with the nlohmann_json target ET
already provides. We work around it by patching MLX's own CMakeLists.txt
at configure time (backends/mlx/CMakeLists.txt:216-238 applies
patches/mlx_json.patch, which wraps the fetch in if(NOT TARGET nlohmann_json)).

Why this is bad: patching a submodule is fragile. The patch is pinned to
specific lines of MLX's CMakeLists.txt and silently stops applying when an
MLX bump touches that region (git apply --check just reports "not
applicable") — and the json collision returns with no clear signal. More
broadly, add_subdirectory gives MLX zero isolation: every shared dep MLX
fetches (today json; tomorrow maybe fmt, gguf, …) is a latent collision,
and MLX's MLX_BUILD_* options leak into ET's cache.

Fix: build MLX in its own isolated CMake scope (via ExternalProject)
and consume it as a prebuilt static lib + headers + metallib through an
imported mlx target. MLX then runs its FetchContent in its own
namespace, so the patch becomes unnecessary and is deleted.

How the build works today

In ET's build (backends/mlx/CMakeLists.txt):

  1. Guards: submodule present (:113-122); deployment target ≥ macOS 14 / iOS 17 (:123-164).
  2. MLX options (:165-214): force-set MLX_BUILD_* (METAL ON; CPU/CUDA/python/tests/gguf/safetensors OFF; static lib; JIT). These work only because add_subdirectory shares the cache.
  3. Patch (:216-238) → deleted by this issue.
  4. add_subdirectory (:239-242) → replaced by this issue.
  5. mlxdelegate (:259-297): links MLX as $<BUILD_INTERFACE:mlx>, so mlx is not re-exported.
  6. Install (:299-340): installs mlxdelegate/mlx_schema/mlx, MLX headers, and mlx.metallib to cmake-out/lib/; caches MLX_METALLIB_PATH.

Downstream consumers then read those installed artifacts — none of them
build MLX themselves:

Consumer How it gets MLX Source
qwen / gemma4 runners find_package(executorch) → imported mlx target + MLX_METALLIB_PATH examples/models/{qwen3_5_moe,gemma4_31b}/CMakeLists.txt
package config recreates imported mlx (find_library + Metal/Foundation) and MLX_METALLIB_PATH (find_file) from cmake-out/lib/ tools/cmake/executorch-config.cmake:124-165
metallib copy helper copies ${MLX_METALLIB_PATH} next to a binary tools/cmake/Utils.cmake:195-213
pybindings wheel copies metallib from a hardcoded build path into the wheel setup.py:1080-1085
ET delegate tests link mlx directly backends/mlx/test/CMakeLists.txt

Key takeaway for scoping: the runners and package config only depend on the
installed cmake-out/lib/{libmlx.a, mlx.metallib} + the exported
mlxdelegate. As long as those keep landing in lib/, ExternalProject vs
add_subdirectory is invisible to them. The real work is confined to ET's
in-tree build (plus the one hardcoded setup.py path).

Proposed design

Part 1 — Build MLX as an ExternalProject

Replace the patch + add_subdirectory (:216-242) with an ExternalProject_Add
that configures MLX in its own binary dir / CMake invocation. Forward the
options force-set today as -D args, plus toolchain/deployment settings, so the
Metal-only static build is identical:

include(ExternalProject)

set(_mlx_install_dir ${CMAKE_CURRENT_BINARY_DIR}/mlx-install)
set(_mlx_static_lib  ${_mlx_install_dir}/lib/libmlx.a)
set(_mlx_metallib    ${_mlx_install_dir}/lib/mlx.metallib)

ExternalProject_Add(
  mlx_external
  SOURCE_DIR  ${MLX_SOURCE_DIR}                # submodule, unmodified
  BINARY_DIR  ${CMAKE_CURRENT_BINARY_DIR}/mlx  # isolated FetchContent scope
  INSTALL_DIR ${_mlx_install_dir}
  CMAKE_ARGS
    -DCMAKE_INSTALL_PREFIX=${_mlx_install_dir}
    -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE}
    -DCMAKE_CXX_STANDARD=${CMAKE_CXX_STANDARD}
    -DCMAKE_OSX_DEPLOYMENT_TARGET=${CMAKE_OSX_DEPLOYMENT_TARGET}
    -DCMAKE_TOOLCHAIN_FILE=${CMAKE_TOOLCHAIN_FILE}   # forward iOS toolchain if set
    -DPLATFORM=${PLATFORM}
    -DMLX_BUILD_METAL=ON -DMLX_BUILD_CPU=OFF -DMLX_BUILD_CUDA=OFF
    -DMLX_BUILD_SHARED_LIBS=OFF -DMLX_BUILD_PYTHON_BINDINGS=OFF
    -DMLX_BUILD_TESTS=OFF -DMLX_BUILD_EXAMPLES=OFF -DMLX_BUILD_BENCHMARKS=OFF
    -DMLX_BUILD_GGUF=OFF -DMLX_BUILD_SAFETENSORS=OFF -DMLX_METAL_JIT=ON
  BUILD_BYPRODUCTS ${_mlx_static_lib} ${_mlx_metallib}   # required for Ninja
)

This is what removes the patch: MLX configures in a separate CMake process where
nlohmann_json does not pre-exist, so the if(NOT TARGET nlohmann_json) guard
is moot.

Part 2 — Expose an imported mlx target (ET in-tree)

mlxdelegate and the ET tests link mlx. With ExternalProject the library is
produced at build time, so define an imported target:

add_library(mlx STATIC IMPORTED GLOBAL)
set_target_properties(mlx PROPERTIES IMPORTED_LOCATION ${_mlx_static_lib})
target_include_directories(mlx INTERFACE ${MLX_SOURCE_DIR})   # headers from source tree
# A static libmlx.a carries no transitive deps — re-add MLX's frameworks
# (mirrors third-party/mlx/CMakeLists.txt:209,253).
find_library(_metal Metal)
find_library(_foundation Foundation)
find_library(_quartz QuartzCore)
set_property(TARGET mlx PROPERTY INTERFACE_LINK_LIBRARIES
            ${_metal} ${_foundation} ${_quartz})   # + Accelerate iff MLX_BUILD_CPU
add_dependencies(mlx mlx_external)

mlxdelegate then links plain mlx (drop the $<BUILD_INTERFACE:mlx> wrapper
at :289-291). Three things to get right:

  1. Frameworks — a static libmlx.a exports no transitive deps, so the
    imported target must re-add Metal/Foundation/QuartzCore, or mlxdelegate,
    portable_lib, and the tests fail with undefined-symbol errors. (ET's
    package config does the downstream half of this at
    executorch-config.cmake:138-147, Metal+Foundation only.)
  2. Compile orderadd_dependencies on an imported target does not order
    the build. Also add add_dependencies(mlxdelegate mlx_external) so
    libmlx.a exists before mlxdelegate links.
  3. Headers — point the include dir at the submodule source tree
    (${MLX_SOURCE_DIR}, always present), not the ExternalProject install dir
    (only materializes after the external build), so mlxdelegate compilation
    can't race the build.

The runners need nothing here — they get mlx from the package config (Part 3),
not this in-tree target.

Part 3 — Keep install rules + MLX_METALLIB_PATH working

Repoint the in-tree sources at the ExternalProject output; keep installing
to CMAKE_INSTALL_LIBDIR (= lib) so the downstream consumers keep finding
everything unchanged:

  • MLX_METALLIB_PATH cache var (:337-340)
  • install(... mlx.metallib ...) (:330-333) and libmlx.a / headers installs (:306-314)

These then need no edits, since they only read the installed lib/ or the
cache var: package config (executorch-config.cmake:124-165) and the copy
helper (Utils.cmake:195-213).

One downstream fix is required — the pybindings wheel uses a hardcoded
build-tree path that the move breaks:

# setup.py:1080-1085 — copies the metallib into the wheel
BuiltFile(
    src_dir="%CMAKE_CACHE_DIR%/backends/mlx/mlx/mlx/backend/metal/kernels/",  # add_subdirectory layout
    src_name="mlx.metallib",
    dst="executorch/extension/pybindings/",
    dependent_cmake_flags=["EXECUTORCH_BUILD_MLX"],
),

Repoint src_dir to the new metallib location. (This is a separate mechanism
from the in-CMake copy at CMakeLists.txt:1161-1165, which uses
MLX_METALLIB_PATH and is fine.)

Confirm the metallib location. Today we copy from MLX's build tree
(mlx/mlx/backend/metal/kernels/mlx.metallib). Check whether MLX's own
install() emits the metallib: if so, point MLX_METALLIB_PATH,
BUILD_BYPRODUCTS, and setup.py src_dir at the install dir; if not, at
the build-tree path inside BINARY_DIR.

Part 4 — Delete the patch machinery

Remove backends/mlx/patches/mlx_json.patch, the _mlx_patches loop
(:216-238), and the patches/ dir if empty.

Alternatives considered

  • FetchContent with EXCLUDE_FROM_ALL / scoped vars — doesn't fix the json
    collision (targets still land in ET's namespace), so the patch would remain.
  • Pre-providing nlohmann_json to MLX — effectively today's patch; keeps the
    coupling we want gone.
  • ExternalProject (recommended) — the only option giving MLX a fully
    separate configure/build, deleting the patch and immunizing ET against future
    shared-dep collisions. Costs: imported-target plumbing (incl. frameworks),
    build-time (not configure-time) availability of libmlx.a, and weaker
    incremental rebuilds (won't rebuild MLX on source change without
    BUILD_ALWAYS) — fine for a pinned submodule contributors don't edit.

Acceptance criteria

  • cmake --workflow --preset mlx-release builds with the MLX submodule
    pristine (git -C backends/mlx/third-party/mlx status clean).
  • patches/mlx_json.patch and the patch loop are deleted.
  • mlxdelegate and mlx are valid, linkable targets.
  • ET delegate tests build/pass (-DEXECUTORCH_BUILD_TESTS=ON):
    op_test_runner, multi_thread_test_runner, mlx_mutable_state_test,
    strict_compile_test (they link mlx directly).
  • mlx.metallib lands next to _portable_lib.so and the qwen/gemma runners
    (CI check .github/workflows/mlx.yml:185-186 passes).
  • The pybindings wheel ships mlx.metallib next to _portable_lib.so;
    python install_executorch.py then running an MLX .pte works.
  • find_package(executorch) consumers still resolve mlx +
    MLX_METALLIB_PATH (i.e. both files install to cmake-out/lib/).
  • make qwen3_5_moe-mlx and make gemma4_31b-mlx build/run unchanged.
  • MLX CI (.github/workflows/mlx.yml) is green; bumping the submodule needs
    no patch.
    -[] delegate binary size is unchanged

Out of scope (follow-ups)

  • Caching/prebuilt MLX artifacts to speed clean builds.
  • Documenting the new flow in backends/mlx/README.md.
  • Applying the same isolation to other backends that vendor shared deps.

Pointers

  • MLX integration: backends/mlx/CMakeLists.txt:113-340 — patch loop :216-238,
    add_subdirectory :239-242, options :165-214, mlx link :289-291,
    install/metallib :299-340
  • Patch: backends/mlx/patches/mlx_json.patch · submodule: .gitmodules
  • Root hookup: CMakeLists.txt:724-727, :1052-1053, :1161-1165
  • Metallib helper: tools/cmake/Utils.cmake:195-213
  • Package config (downstream mlx + MLX_METALLIB_PATH):
    tools/cmake/executorch-config.cmake:124-165
  • Pybindings wheel metallib copy (must repoint): setup.py:1080-1085
  • Presets: CMakePresets.json:339-377; tools/cmake/preset/mlx.cmake
  • Tests: backends/mlx/test/CMakeLists.txt
  • Runners (no change expected): examples/models/{qwen3_5_moe,gemma4_31b,llama,voxtral,voxtral_realtime,parakeet}/CMakeLists.txt
  • CI: .github/workflows/mlx.yml

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions