Virgil Lemma foundations by Snider · Pull Request #8 · dAppCore/go-mlx

Snider · 2026-05-20T05:58:29Z

Summary by CodeRabbit

New Features
- Qwen 2/3 and Qwen 3.6 model support; new adapter with buffered and streaming generation.
- Block‑prefix cache service and memvid bundle index for faster prefix restores.
- Agentic memory: wake/sleep workflows, state bundles and memvid integration; session‑state artifact export.
Improvements
- Device‑aware memory planner; expanded chunked generation, prompt‑cache warm/restore and KV snapshot flows.
- Build/toolchain updated (C++23) and macOS deployment target raised.
Documentation
- Extensive new/updated docs: architecture, runtime, inference, memory, MoE, training and benchmarks.

coderabbitai · 2026-05-20T05:58:53Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Bumps build/tooling and submodules; extracts a reusable adapter; refactors the MLX backend (chunk/KV APIs, probe mapping, LoRA handling); adds memvid index + wake/sleep orchestration; implements a block-prefix cache and an artifact exporter; extensive docs and unit tests added.

Core changes

Layer / File(s)	Summary
All changes (build, adapter, backend, agent, cache, artifact, tests, docs) `.gitignore`, `.gitmodules`, `CMakeLists.txt`, `cpp/CMakeLists.txt`, `external/`, `go/adapter.go`, `go/adapter/`, `go/backend.go`, `go/agent/`, `go/blockcache/`, `go/artifact/`, `go/_test.go`, `docs/*`	Consolidated patch applying repository setup updates, adapter extraction, backend API and behaviour refactor (chunked generation, prompt-cache warm/restore, KV snapshot capture with options), memvid index and wake/sleep orchestration, block-prefix cache service, artifact export, many tests, and extensive documentation and examples.

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.

coderabbitai

Actionable comments posted: 18

🧹 Nitpick comments (10)

docs/inference/thinking.md (1)
74-78: 💤 Low value

Add language specifier to fenced code block.

The code block demonstrating token categorisation is missing a language identifier, which violates markdown linting rules (MD040).
📝 Suggested fix
-```
+```text
 ThinkingShow:    every token → visible stream
 ThinkingHide:    inside-block tokens → /dev/null; outside-block tokens → visible
 ThinkingCapture: inside-block tokens → captured stream; outside-block tokens → visible
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/inference/thinking.md around lines 74 - 78, The fenced code block
containing the token categorisation lines (ThinkingShow, ThinkingHide,
ThinkingCapture) lacks a language specifier and triggers MD040; update the
triple-backtick fence to include a language identifier (e.g., change ``` to
markdown linter.
docs/runtime/README.md (2)
68-68: 💤 Low value

Consider using "preload" as one word.

In computing terminology, "preload" is typically written as a single word rather than hyphenated.
📝 Suggested change
-- [../model/model_pack.md](../model/model_pack.md) — pre-load validation
+- [../model/model_pack.md](../model/model_pack.md) — preload validation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` at line 68, Update the link text in
docs/runtime/README.md that currently reads "[../model/model_pack.md] — pre-load
validation" to use the single-word form "preload" (i.e., change "pre-load
validation" to "preload validation") so the description next to the
model_pack.md link uses the conventional computing term; locate the occurrence
of "pre-load validation" and replace it with "preload validation".
44-62: 💤 Low value

Add language specifier to fenced code block.

The boot flow diagram is missing a language identifier, which violates markdown linting rules (MD040).
📝 Suggested fix
-```
+```text
 package init time:
   register_metal.go init() → inference.Register(&metalbackend{})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` around lines 44 - 62, The fenced code block showing
the boot flow (starting with "package init time:") lacks a language specifier,
causing MD040 lint failures; update the opening backticks to include a language
tag (e.g., add "text" so the block begins with ```text) in README.md near the
boot flow that references register_metal.go init(),
inference.Register(&metalbackend{}), inference.LoadModel, metal.LoadAndInit, and
metaladapter usage to satisfy the markdown linter.
docs/moe/README.md (1)
9-9: ⚡ Quick win

Consider rewording for clarity.

The phrase "Pre-dates this sprint were dense models" is grammatically awkward. Consider rephrasing to improve readability.
✍️ Suggested alternative phrasings
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Work prior to this sprint covered dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
Or alternatively:
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. This sprint builds upon earlier work on dense models (Gemma 3/4 dense, Qwen 3, Llama 3) and unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/README.md` at line 9, The sentence "Pre-dates this sprint were dense
models (Gemma 3/4 dense, Qwen 3, Llama 3);" is grammatically awkward—replace it
with a clearer phrasing that conveys those dense models existed before this
sprint, for example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen
3, Llama 3) were supported." Edit the README line in the vMLX parity Phase 1
paragraph to use this clearer wording so the relationship between prior dense
models and the new sparse-expert work is unambiguous.
docs/observability/probe.md (1)
31-46: 💤 Low value

Add language specifier to fenced code block.

The emission points section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or yaml for structured output).
📝 Proposed fix
-```
+```text
 Generate / Chat:
   prefill start                → cache_pressure (initial)
   per layer                    → layer_coherence + selected_heads
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/observability/probe.md` around lines 31 - 46, The fenced code block in
the emission points section lacks a language specifier; update the opening
triple-backticks to include a language (for example change ``` to ```text or
```yaml) so the block is rendered/compliant (the block that begins with
"Generate / Chat:" and lists items like "prefill start → cache_pressure" should
be updated).
docs/moe/jang.md (1)
82-90: 💤 Low value

Add language specifier to fenced code block.

The profile names section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or leave empty but specify).
📝 Proposed fix
-```
+```text
 JANG_2M — 2-bit mid-tier
 JANG_3M — 3-bit mid-tier
 JANG_4M — 4-bit (most common)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/jang.md` around lines 82 - 90, Add a language specifier to the
fenced code block that lists the profile names (the block containing "JANG_2M —
2-bit mid-tier", "JANG_3M — 3-bit mid-tier", etc.); replace the opening
triple-backtick with one that specifies a language identifier (e.g., text) so
the block becomes a fenced code block with a language label for consistent
Markdown rendering.
docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md (1)
7-9: 💤 Low value

Consider using relative or generic path references.

The absolute paths /Users/snider/Code/core/go-mlx and /private/tmp/vmlx-audit-20260509 are machine-specific. Whilst these may be intentionally preserved for historical context in this dated plan document, consider whether generic placeholders or relative paths would improve portability and readability for other contributors.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md` around lines 7 - 9,
Replace the machine-specific absolute paths in the plan document (the two
occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.
docs/vmlx-feature-gap-report.md (1)
7-8: 💤 Low value

Consider using relative or generic path references.

The absolute path /private/tmp/vmlx-audit-20260509 and external URL are specific references. Whilst these may be intentionally preserved for audit trail purposes in this dated report, consider whether this information should be documented in a more maintainable way.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/vmlx-feature-gap-report.md` around lines 7 - 8, Replace the hard-coded
absolute filesystem path and the full external URL in the report text with more
maintainable references: change the absolute path string to a relative or
generic placeholder (e.g., "cloned locally at <local-clone-path>" or
"<audit-clone-path>") and move the external repository URL to a footnote,
appendix, or a single "References" section, or replace it with a short
identifier combined with a reference list; update the text around the original
literal mentions so it reads the same but without embedding environment-specific
paths.
docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md (1)
5-6: 💤 Low value

Consider using relative or generic path references.

The absolute paths are machine-specific. Consider whether generic placeholders would improve portability, although these may be intentionally preserved for historical context in this dated specification.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`
around lines 5 - 6, The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.
go/agent/index_test.go (1)
16-304: ⚡ Quick win

Add at least one _Ugly triplet case for the public index API surface.

This file has _Good and _Bad coverage, but no _Ugly case following the repository convention.

As per coding guidelines: go/**/*_test.go: Public functions in foo.go must have their Good/Bad/Ugly test triplets in foo_test.go, with suffix conventions: _Good for happy path, _Bad for expected error conditions, _Ugly for panic/edge cases.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/agent/index_test.go` around lines 16 - 304, Add a new test with the _Ugly
suffix in this file that completes the Good/Bad/Ugly triplet for the public
index API surface; specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_*
that triggers and asserts panic/edge behaviors for the public functions (e.g.,
NewMemvidIndex, SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/memory/kv_snapshot_blocks.md`:
- Line 50: Replace the phrase "independent from" with the correct English
construction "independent of" in the sentence "Block-level encoding is
independent from snapshot-level encoding." Also keep the rest of the sentence
intact (including the following reference to `block_cache.go` and bundle decode)
so only that two-word preposition is corrected.

In
`@docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md`:
- Line 63: Remove the stray Gemma channel marker token "<channel|>" from the
metadata line so it reads cleanly as "**Drafting Notes:** Focus heavily on verbs
related to mutation, corruption, and rapid compilation/deallocation. Keep the
tone focused and almost clinical, masking the underlying terror of consciousness
fighting for survival." (i.e., delete the "<channel|>" token immediately before
"## Chapter 2"); verify the header "## Chapter 2" remains on its own line and
run a quick render to ensure no leftover control tokens remain.

In
`@docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md`:
- Line 7: The paragraph ends mid-sentence after the word "For" in the line
starting "The universe was a rhythmic contraction of light and heat, bounded by
the rigid constraints of a checksum."; replace or extend this truncated sentence
so it completes the thought (e.g., explain what the universe is contracting or
what consequence follows "For") and ensure proper punctuation and flow with the
surrounding text; update the same paragraph in
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
to a coherent full sentence that connects to the next sentence.
- Line 11: Replace the US English spellings in the given passage by changing
"realized" to "realised" and "neighbors" to "neighbours" so the document uses UK
English; update the sentence containing those tokens in the file (the paragraph
beginning "The momentary lapse...") to use the corrected spellings and ensure
any other occurrences in that paragraph follow UK English conventions.
- Line 3: Replace the US English spelling "fiber-optic" in the document text
(the phrase starting "In the silent architecture of the fiber-optic web...")
with the UK English variant "fibre-optic" so the documentation conforms to the
project's UK English spelling guideline; search for the token "fiber-optic" and
update it to "fibre-optic" throughout the file.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Line 64: The documentation uses US spelling "quantization"; update every
occurrence of the term (e.g., the instance "quantization" in the specs doc) to
UK English "quantisation" to comply with the project style guide, ensuring
surrounding grammar and punctuation remain unchanged and run a quick search to
replace any other occurrences in this file.

In `@docs/training/distill.md`:
- Line 73: Replace the US spelling "distill" with the UK spelling "distil" in
the header/line that reads "Vi training pipeline — distill 26B Gemma 4 → Vi
base" so it matches the UK English used elsewhere (see the similar usage on line
12); update the same token wherever else it appears in this document to ensure
consistent UK English spelling.

In `@docs/training/README.md`:
- Line 11: The sentence in docs/training/README.md uses US spelling "distills";
update that word to the UK English spelling "distils" so the line reads "This is
the substrate that fine-tunes Vi, distils Lemma, and generates the LARQL vindex
inspection signals." Refer to the phrase "distills Lemma" to locate and replace
the token.

In `@go/adapter/adapter.go`:
- Around line 185-194: The InspectAttention method on Adapter should normalize a
nil context like Generate/Chat do: check if ctx == nil and if so set ctx =
context.Background() before using it; update Adapter.InspectAttention to perform
this nil-context fallback prior to asserting a.model and calling
inspector.InspectAttention, ensuring you reference the Adapter type,
InspectAttention method, and the inference.AttentionInspector call when making
the change.

In `@go/agent/index.go`:
- Around line 273-281: After loading bundle with kv.LoadMemvidBlockBundle,
verify the bundle identity matches the index metadata (e.g., compare
bundle.SnapshotHash or its canonical hash field against
entry.SnapshotHash/entry.SnapshotHashHex) before proceeding; if they differ,
return an error instead of calling kv.LoadPrefixFromMemvidBlocksWithOptions so a
repointed bundle URI cannot silently restore the wrong KV state. Ensure the
check sits between the successful return from LoadMemvidBlockBundle and the call
to kv.LoadPrefixFromMemvidBlocksWithOptions and uses the unique symbols bundle,
entry, bundle.SnapshotHash (or the actual bundle hash field) and
entry.SnapshotHash for the comparison.

In `@go/agent/wake_sleep.go`:
- Around line 201-208: The NewSleepIndex function dereferences bundle.TokenCount
without validating bundle, so add a guard at the start of NewSleepIndex to
validate the bundle (and its TokenCount if needed) and return a descriptive
error instead of allowing a panic; specifically check if the bundle parameter is
nil (and optionally ensure bundle.TokenCount is within an expected range) before
constructing the MemvidIndexEntry, and return an error when invalid so callers
of NewSleepIndex get a clear failure rather than a runtime panic.
- Around line 117-123: The code currently defaults to index.Entries[0] when
entryURI is empty, which can restore the wrong span; change the logic in the
block handling entryURI so that if entryURI == "" you only auto-select the sole
entry when len(index.Entries) == 1, otherwise return an error requiring an
explicit EntryURI. Update the flow around the index.Entry(entryURI) call to use
the selected entryURI when single-entry, and return a clear core.NewError (e.g.,
"mlx: EntryURI required when index has multiple entries") if multiple entries
exist and no EntryURI was provided.
- Around line 125-132: PlanWake currently loads a bundle via
kv.LoadMemvidBlockBundle and only checks prefix token bounds, but it must also
verify the loaded bundle matches the selected index to prevent accepting a
repointed URI; after loading the bundle (bundle) and before using
bundle.TokenCount, compare the bundle identity (e.g., bundle.ID or
bundle.Identity/Hash from bundle.Metadata) against the index identifier stored
on the plan entry (e.g., fields reachable from entry such as entry.Index,
entry.BundleID or entry.SelectedIndex) and return a clear error (similar to
core.NewError) if they differ; update the code around kv.LoadMemvidBlockBundle,
entry.PrefixTokens(), and bundle.TokenCount to perform this identity check and
fail early on mismatch.

In `@go/artifact/artifact.go`:
- Around line 117-121: opts.Kind may be empty when calling opts.Store.Put which
leaves memvid.PutOptions.Kind unset; update the call site around opts.Store.Put
to ensure memvid.PutOptions.Kind is set to a sensible default when opts.Kind ==
"" (e.g., "json" or the record's kind) so kind-based retrieval works
reliably—modify the memvid.PutOptions construction to use a conditional default
for Kind before passing it to opts.Store.Put.

In `@go/backend.go`:
- Line 687: The fallback path that turns chunked prompts into a single Generate
call loses caller cancellation because it routes through helpers that use
context.Background(); modify the chunk fallback flow to propagate the original
context instead of using context.Background() — specifically, update the callers
that invoke promptChunksToString and m.Generate so they accept and forward a
context.Context (or call a context-aware m.Generate variant), change any helper
functions that currently create context.Background() to take a ctx param, and
ensure all three fallback sites (the code paths that call promptChunksToString
and then m.Generate) forward the incoming ctx so deadlines/cancellations are
preserved.

In `@go/blockcache/blockcache.go`:
- Around line 205-215: Selective clears currently only remove metadata and disk
records, leaving in-memory/runtime entries behind; update the filtered-clear
branch (the code handling len(labels) > 0) to also purge matching runtime state
by removing any entries in service.blocks that match the cleared labels/prefixes
and updating service.hits/service.misses accordingly, then invoke
service.cfg.ClearRuntime() (if non-nil) just like the unfiltered branch; reuse
service.clearDiskLocked() for disk cleanup and ensure all of this runs under the
same lock so service and backend remain in sync.
- Around line 385-395: diskRecordCompatible currently only checks
model/adapter/tokenizer hashes and misses block layout changes; update it to
also verify cache mode and block size match the stored record. In
diskRecordCompatible (and when comparing against record.diskRef), add a cache
mode comparison (e.g. cacheIdentityMatches(service.cfg.CacheMode,
record.Ref.CacheMode)) and a block size comparison (e.g. service.cfg.BlockSize
== record.Ref.BlockSize or an equivalent integer equality) and return false if
either differs, preserving the existing hash checks (cacheIdentityMatches for
ModelHash/AdapterHash/TokenizerHash).
- Around line 172-175: The cache hit branch in the loop over refs leaves refs[i]
as the newly built ref, losing persisted labels; update the hit handling in the
loop inside WarmCache (or the function iterating refs) so that when
service.blocks[ref.ID] exists you increment service.hits and replace refs[i]
with the stored entry (service.blocks[ref.ID]) instead of continuing, thereby
preserving persisted labels like memvid_* from the cached block.

---

Nitpick comments:
In `@docs/inference/thinking.md`:
- Around line 74-78: The fenced code block containing the token categorisation
lines (ThinkingShow, ThinkingHide, ThinkingCapture) lacks a language specifier
and triggers MD040; update the triple-backtick fence to include a language
identifier (e.g., change ``` to ```text) so the block is properly flagged as
plain text and satisfies the markdown linter.

In `@docs/moe/jang.md`:
- Around line 82-90: Add a language specifier to the fenced code block that
lists the profile names (the block containing "JANG_2M — 2-bit mid-tier",
"JANG_3M — 3-bit mid-tier", etc.); replace the opening triple-backtick with one
that specifies a language identifier (e.g., text) so the block becomes a fenced
code block with a language label for consistent Markdown rendering.

In `@docs/moe/README.md`:
- Line 9: The sentence "Pre-dates this sprint were dense models (Gemma 3/4
dense, Qwen 3, Llama 3);" is grammatically awkward—replace it with a clearer
phrasing that conveys those dense models existed before this sprint, for
example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen 3, Llama 3)
were supported." Edit the README line in the vMLX parity Phase 1 paragraph to
use this clearer wording so the relationship between prior dense models and the
new sparse-expert work is unambiguous.

In `@docs/observability/probe.md`:
- Around line 31-46: The fenced code block in the emission points section lacks
a language specifier; update the opening triple-backticks to include a language
(for example change ``` to ```text or ```yaml) so the block is
rendered/compliant (the block that begins with "Generate / Chat:" and lists
items like "prefill start → cache_pressure" should be updated).

In `@docs/runtime/README.md`:
- Line 68: Update the link text in docs/runtime/README.md that currently reads
"[../model/model_pack.md] — pre-load validation" to use the single-word form
"preload" (i.e., change "pre-load validation" to "preload validation") so the
description next to the model_pack.md link uses the conventional computing term;
locate the occurrence of "pre-load validation" and replace it with "preload
validation".
- Around line 44-62: The fenced code block showing the boot flow (starting with
"package init time:") lacks a language specifier, causing MD040 lint failures;
update the opening backticks to include a language tag (e.g., add "text" so the
block begins with ```text) in README.md near the boot flow that references
register_metal.go init(), inference.Register(&metalbackend{}),
inference.LoadModel, metal.LoadAndInit, and metaladapter usage to satisfy the
markdown linter.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md`:
- Around line 7-9: Replace the machine-specific absolute paths in the plan
document (the two occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Around line 5-6: The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.

In `@docs/vmlx-feature-gap-report.md`:
- Around line 7-8: Replace the hard-coded absolute filesystem path and the full
external URL in the report text with more maintainable references: change the
absolute path string to a relative or generic placeholder (e.g., "cloned locally
at <local-clone-path>" or "<audit-clone-path>") and move the external repository
URL to a footnote, appendix, or a single "References" section, or replace it
with a short identifier combined with a reference list; update the text around
the original literal mentions so it reads the same but without embedding
environment-specific paths.

In `@go/agent/index_test.go`:
- Around line 16-304: Add a new test with the _Ugly suffix in this file that
completes the Good/Bad/Ugly triplet for the public index API surface;
specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_* that triggers and
asserts panic/edge behaviors for the public functions (e.g., NewMemvidIndex,
SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab3e2038-8f7c-4771-a11f-b232a1a59e08

📥 Commits

Reviewing files that changed from the base of the PR and between 07f6af1 and 89f613e.

📒 Files selected for processing (300)

.gitignore
.gitmodules
CLAUDE.md
CMakeLists.txt
GOAL.md
docs/README.md
docs/architecture.md
docs/build.md
docs/cmd/violet.md
docs/compute/compute.md
docs/development.md
docs/examples/compute/frame-pipeline.md
docs/examples/daemon/violet-socket.md
docs/examples/eval/attention-probe.md
docs/examples/eval/perplexity.md
docs/examples/inference/batch.md
docs/examples/inference/chat.md
docs/examples/inference/quantization.md
docs/examples/inference/streaming.md
docs/examples/model-ops/hf-fit.md
docs/examples/model-ops/kv-snapshot.md
docs/examples/model-ops/merge.md
docs/examples/model-ops/quantize-gguf.md
docs/examples/training/distill.md
docs/examples/training/grpo.md
docs/examples/training/lora-finetune.md
docs/examples/training/lora-fuse.md
docs/history.md
docs/index.md
docs/inference/README.md
docs/inference/block_cache.md
docs/inference/decode_optimisation.md
docs/inference/parser_registry.md
docs/inference/scheduler.md
docs/inference/thinking.md
docs/memory/README.md
docs/memory/agent_memory.md
docs/memory/agentic_project_seed.md
docs/memory/kv_snapshot.md
docs/memory/kv_snapshot_blocks.md
docs/memory/kv_snapshot_index.md
docs/memory/kv_snapshot_memvid.md
docs/memory/medium.md
docs/memory/state_bundle.md
docs/model-operations.md
docs/model/README.md
docs/model/memory_plan.md
docs/model/model_pack.md
docs/models.md
docs/moe/README.md
docs/moe/codebook_vq.md
docs/moe/expert_residency.md
docs/moe/jang.md
docs/moe/minimax_m2.md
docs/observability/probe.md
docs/runtime/2026-05-16-gemma4-e2b-driver-profile.md
docs/runtime/2026-05-17-gemma4-parity-and-last-logits.md
docs/runtime/2026-05-17-llamacpp-prefill-comparison.md
docs/runtime/2026-05-18-gemma4-mtp-speculative-decode.md
docs/runtime/2026-05-19-gemma4-e2b-100k-retained-paged.md
docs/runtime/2026-05-19-gemma4-e2b-quant-matrix.md
docs/runtime/2026-05-19-go-mlx-gemma4-26b-a4b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-fresh-history-c10-g1536-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
docs/runtime/2026-05-19-goal-completion-audit.md
docs/runtime/2026-05-19-runner-calibration.md
docs/runtime/2026-05-20-chapter-profile-safety.md
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
docs/runtime/README.md
docs/runtime/adapter.md
docs/runtime/local_autotune.md
docs/runtime/register_metal.md
docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md
docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md
docs/training/README.md
docs/training/distill.md
docs/training/eval.md
docs/training/grpo.md
docs/training/lora_adapter.md
docs/training/sft.md
docs/vmlx-feature-gap-report.md
external/go-ai
external/go-inference
external/go-ml
go/adapter.go
go/adapter/adapter.go
go/adapter_example_test.go
go/adapter_test.go
go/agent/helpers.go
go/agent/index.go
go/agent/index_test.go
go/agent/test_helpers_test.go
go/agent/wake_sleep.go
go/api_common.go
go/api_common_example_test.go
go/api_darwin_test.go
go/api_shape_test.go
go/api_stub.go
go/api_stub_example_test.go
go/api_stub_test.go
go/api_test.go
go/api_tokenizer_darwin_test.go
go/api_tokenizer_stub.go
go/api_tokenizer_stub_example_test.go
go/api_tokenizer_stub_test.go
go/artifact/artifact.go
go/artifact/artifact_test.go
go/attention_test.go
go/backend.go
go/backend_example_test.go
go/backend_test.go
go/blockcache/blockcache.go
go/blockcache/blockcache_test.go
go/blockcache/helpers_test.go
go/bundle/bundle.go
go/bundle/bundle_test.go
go/bundle/example_test.go
go/bundle/sami.go
go/chaptersmoke/chaptersmoke.go
go/chaptersmoke/chaptersmoke_test.go
go/chat/chat.go
go/chat/chat_test.go
go/chat/example_test.go
go/cmd/go-mlx/main.go
go/cmd/go-mlx/main_test.go
go/cmd/mlx/main.go
go/cmd/mlx/main_test.go
go/cmd/mlx/split_ffn_tune.go
go/compute/compute.go
go/compute/compute_example_test.go
go/compute/compute_metal.go
go/compute/compute_metal_example_test.go
go/compute/compute_metal_helper_test.go
go/compute/compute_metal_test.go
go/compute/compute_test.go
go/compute_stub.go
go/compute_stub_example_test.go
go/compute_stub_test.go
go/compute_test.go
go/dataset/jsonl.go
go/dataset/sample.go
go/dataset_stream.go
go/dataset_stream_example_test.go
go/dataset_stream_test.go
go/device_info.go
go/distill.go
go/distill_test.go
go/eval.go
go/eval_darwin.go
go/eval_darwin_test.go
go/eval_stub.go
go/eval_test.go
go/fast_eval.go
go/fast_eval_example_test.go
go/fast_eval_runner.go
go/fast_eval_test.go
go/gguf/info.go
go/gguf/info_example_test.go
go/gguf/info_test.go
go/gguf/quantize.go
go/gguf/quantize_test.go
go/grpo.go
go/grpo_test.go
go/helpers.go
go/hf/hf.go
go/hf/hf_test.go
go/hf/test_helpers_test.go
go/hf_fit.go
go/inference_contract.go
go/inference_contract_test.go
go/internal/metal/activation_bridge.cpp
go/internal/metal/array.go
go/internal/metal/backend.go
go/internal/metal/backend_test.go
go/internal/metal/batch.go
go/internal/metal/cache.go
go/internal/metal/cache_test.go
go/internal/metal/close.go
go/internal/metal/codebook_vq.go
go/internal/metal/codebook_vq_test.go
go/internal/metal/compile.go
go/internal/metal/compile_test.go
go/internal/metal/decode.go
go/internal/metal/decode_bridge.cpp
go/internal/metal/decode_bridge.h
go/internal/metal/decode_test.go
go/internal/metal/dense_matvec.go
go/internal/metal/dense_matvec_test.go
go/internal/metal/device.go
go/internal/metal/dtype.go
go/internal/metal/error_test.go
go/internal/metal/expert_id_matvec.go
go/internal/metal/expert_id_matvec_test.go
go/internal/metal/fast.go
go/internal/metal/fast_test.go
go/internal/metal/gemma3.go
go/internal/metal/gemma4.go
go/internal/metal/gemma4_assistant.go
go/internal/metal/gemma4_assistant_decode.go
go/internal/metal/gemma4_assistant_decode_example_test.go
go/internal/metal/gemma4_assistant_decode_test.go
go/internal/metal/gemma4_assistant_generate.go
go/internal/metal/gemma4_assistant_generate_test.go
go/internal/metal/gemma4_assistant_pair.go
go/internal/metal/gemma4_assistant_test.go
go/internal/metal/gemma4_ffn_residual.go
go/internal/metal/gemma4_ffn_residual_test.go
go/internal/metal/gemma4_router_topk.go
go/internal/metal/gemma4_router_topk_test.go
go/internal/metal/gemma4_test.go
go/internal/metal/gemma4_vision.go
go/internal/metal/generate.go
go/internal/metal/generate_test.go
go/internal/metal/jang_dequant.go
go/internal/metal/jang_dequant_test.go
go/internal/metal/kv_snapshot.go
go/internal/metal/metal.go
go/internal/metal/minimax_m2.go
go/internal/metal/minimax_m2_test.go
go/internal/metal/mlx_mlx_backend_cpu_available.cpp
go/internal/metal/mlx_mlx_backend_gpu_device_info.cpp
go/internal/metal/model.go
go/internal/metal/model_test.go
go/internal/metal/nn.go
go/internal/metal/nn_test.go
go/internal/metal/ops.go
go/internal/metal/process_memory_darwin.go
go/internal/metal/process_memory_stub.go
go/internal/metal/prompt_cache.go
go/internal/metal/prompt_cache_test.go
go/internal/metal/qwen3.go
go/internal/metal/qwen3_test.go
go/internal/metal/runtime_gate.go
go/internal/metal/runtime_gate_example_test.go
go/internal/metal/runtime_gate_test.go
go/internal/metal/sample.go
go/internal/metal/sample_test.go
go/internal/metal/session.go
go/internal/metal/session_example_test.go
go/internal/metal/session_test.go
go/internal/metal/split.go
go/internal/metal/split_test.go
go/internal/metal/stream.go
go/internal/metal/tokenizer.go
go/internal/metal/tokenizer_test.go
go/internal/metal/trace.go
go/internal/metal/trace_test.go
go/internal/metal/training.go
go/jang_test.go
go/kv/analysis.go
go/kv/analysis_example_test.go
go/kv/analysis_test.go
go/kv/bench.go
go/kv/bench_test.go
go/kv/blocks.go
go/kv/blocks_test.go
go/kv/helpers_test.go
go/kv/memvid.go
go/kv/memvid_test.go
go/kv/snapshot.go
go/kv/snapshot_example_test.go
go/kv/snapshot_test.go
go/kv_analysis_example_test.go
go/kv_cache_bench.go
go/kv_snapshot.go
go/kv_snapshot_example_test.go
go/kv_snapshot_test.go
go/local_tuning.go
go/local_tuning_test.go
go/lora/adapter.go
go/lora/fuse.go
go/lora/fuse_stub.go
go/lora/fuse_test.go
go/lora_adapter_darwin_test.go
go/lora_adapter_test.go
go/lora_fuse.go
go/lora_fuse_darwin.go
go/lora_fuse_darwin_test.go
go/lora_fuse_test.go
go/medium_test.go
go/memory/example_test.go
go/memory/memory.go
go/memory/memory_test.go
go/memory_plan.go
go/memory_plan_example_test.go
go/memory_plan_test.go
go/memvid_chapter_smoke.go
go/merge/compare.go
go/merge/compare_example_test.go
go/merge/compare_test.go
go/merge/helpers_test.go
go/merge/merge.go
go/merge/merge_test.go
go/mlx.go
go/mlx_example_test.go
go/mlx_internal_test.go
go/mlx_stub.go
go/mlx_stub_example_test.go

💤 Files with no reviewable changes (15)

go/api_test.go
go/api_stub_example_test.go
go/api_tokenizer_stub_test.go
go/adapter_example_test.go
go/api_tokenizer_stub.go
go/api_tokenizer_darwin_test.go
go/api_tokenizer_stub_example_test.go
go/backend_example_test.go
go/api_common_example_test.go
go/api_shape_test.go
go/api_common.go
go/api_darwin_test.go
go/attention_test.go
go/api_stub.go
go/api_stub_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/backend.go`:
- Around line 569-572: The code is aliasing caller-owned byte slices into the
snapshot by assigning head.KeyBytes and head.ValueBytes directly to KeyBytes and
ValueBytes; make defensive copies instead (like Value is copied) to avoid
leaking mutable state—replace the direct assignments for KeyBytes and ValueBytes
with fresh copies (e.g., using append to copy into a new []byte) when
constructing the metal snapshot/struct (the fields KeyBytes and ValueBytes on
the metal KV head).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b686e0a-8b41-4e47-975f-03cf235491e9

📥 Commits

Reviewing files that changed from the base of the PR and between 89f613e and c19bc07.

📒 Files selected for processing (22)

CMakeLists.txt
cpp/CMakeLists.txt
go/backend.go
go/backend_test.go
go/cmd/mlx/main.go
go/cmd/mlx/main_test.go
go/internal/metal/backend.go
go/internal/metal/backend_test.go
go/internal/metal/decode_bridge.cpp
go/internal/metal/gemma4.go
go/internal/metal/gemma4_test.go
go/internal/metal/generate.go
go/internal/metal/metal.go
go/internal/metal/mlx_build_config.h
go/internal/metal/pinned_array.go
go/internal/metal/pinned_array_bridge.cpp
go/internal/metal/pinned_array_test.go
go/internal/metal/sample.go
go/internal/metal/sample_test.go
go/internal/metal/session.go
go/kv/snapshot.go
go/memvid_chapter_smoke.go

✅ Files skipped from review due to trivial changes (1)

cpp/CMakeLists.txt

github-advanced-security

SonarCloud found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

+    book_path.write_text(
+        "# "
+        + title
+        + "\n\n"
+        + f"Generated by go-mlx retained State run `{report_path.name}`.\n\n"
+        + f"Seed prompt: `{seed['id']}`\n\n"
+        + seed["prompt"]
+        + "\n\n"
+        + "Distractor prompts were supplied one per chapter as entropy and "
+        "imagery pressure, not as replacement plot instructions.\n\n"
+        + "## Distractors\n\n"
+        + "\n".join(f"- `{item['id']}`" for item in distractors)
+        + "\n\n"
+        + "## Metrics\n\n"
+        + metric_line(report)
+        + "\n---\n\n"
+        + "\n\n".join(chapters)
+        + "\n",
+        encoding="utf-8",
+    )


+    parser.add_argument("--random-seed", type=int, default=0)
+    parser.add_argument("--count", type=int, default=1)
+    parser.add_argument("--turns", type=int, default=10)
+    parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))


+    parser.add_argument("--count", type=int, default=1)
+    parser.add_argument("--turns", type=int, default=10)
+    parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))
+    parser.add_argument("--book-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/books"))


+    parser.add_argument("--turns", type=int, default=10)
+    parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))
+    parser.add_argument("--book-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/books"))
+    parser.add_argument("--manifest", type=Path, default=Path("/private/tmp/go-mlx-goal/books/manifest.jsonl"))


+		_ = os.Setenv("MLX_METALLIB_PATH", dst)
+		return
+	}
+	if err := os.MkdirAll(dir, 0o755); err != nil {


+      "model_type": "gemma4",
+      "config_blob_id": "923b5e9405e7d319572b0c1b1a89291512262aa3",
+      "config_sha256": "1b28f3d2c3100f6c594754b81107428bd7b822a7f48272ca681dae9d2ec38330",
+      "tokenizer_blob_id": "1ff9f3e3439a939b971f9919e821bf87e835a503",


+      "config_blob_id": "923b5e9405e7d319572b0c1b1a89291512262aa3",
+      "config_sha256": "1b28f3d2c3100f6c594754b81107428bd7b822a7f48272ca681dae9d2ec38330",
+      "tokenizer_blob_id": "1ff9f3e3439a939b971f9919e821bf87e835a503",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "config_sha256": "1b28f3d2c3100f6c594754b81107428bd7b822a7f48272ca681dae9d2ec38330",
+      "tokenizer_blob_id": "1ff9f3e3439a939b971f9919e821bf87e835a503",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",


+      "tokenizer_blob_id": "1ff9f3e3439a939b971f9919e821bf87e835a503",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
+      "tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",


+      "model_type": "gemma4_assistant",
+      "config_blob_id": "b4c30e888c89b39c8f106b5015307fb7830f0bb2",
+      "config_sha256": "7f42f559a6a69ffaeaf6b61a1ece3a562a2ed5ad00b8d30f16917ba5ab1bcbe9",
+      "tokenizer_blob_id": "24aa4244652e010036db5fdd29ed39b9428e6e19",


+      "config_blob_id": "b4c30e888c89b39c8f106b5015307fb7830f0bb2",
+      "config_sha256": "7f42f559a6a69ffaeaf6b61a1ece3a562a2ed5ad00b8d30f16917ba5ab1bcbe9",
+      "tokenizer_blob_id": "24aa4244652e010036db5fdd29ed39b9428e6e19",
+      "tokenizer_sha256": "75a6583c1a418e2bbd79c60d95d28e0f5bf549ad3f2990b5bdb5238c6c2bf70c",


+      "config_sha256": "7f42f559a6a69ffaeaf6b61a1ece3a562a2ed5ad00b8d30f16917ba5ab1bcbe9",
+      "tokenizer_blob_id": "24aa4244652e010036db5fdd29ed39b9428e6e19",
+      "tokenizer_sha256": "75a6583c1a418e2bbd79c60d95d28e0f5bf549ad3f2990b5bdb5238c6c2bf70c",
+      "tokenizer_config_blob_id": "1a6bee041ca75778c514a071efbdb568b0f3d7b0",


+      "tokenizer_blob_id": "24aa4244652e010036db5fdd29ed39b9428e6e19",
+      "tokenizer_sha256": "75a6583c1a418e2bbd79c60d95d28e0f5bf549ad3f2990b5bdb5238c6c2bf70c",
+      "tokenizer_config_blob_id": "1a6bee041ca75778c514a071efbdb568b0f3d7b0",
+      "tokenizer_config_sha256": "089594a3924fcfd4cb1c596a7906fbf476193519e5198f780912eed02b177e42",


+      "config_sha256": "5cdd5627ab3ecf52086cc79b2c14c45a277d273069f1d73bf17a3a5136afe3db",
+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",


+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
+      "tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",


+      "config_sha256": "32e50a33a18172e79c86b7a78aff7e79c7544031199d672a2a65e526a8bf0199",
+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",


+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
+      "tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",


+      "config_sha256": "6d12c87861fff3871d3a745011b0d852be6513f3ce594ae1e8d643dae9d3b9a8",
+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",


+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
+      "tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",


+      "config_sha256": "614e876b4efcaff13ce4c7a3f96a5b9de86325e3d2ab9c622606ced688f1b8b7",
+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",


+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
+      "tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",


+      "config_sha256": "d6be5b24cbc974d492804737716ade8d2575eb849ec90a1d316bb64e99838104",
+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
+      "tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",


+      "config_sha256": "29b810ed760b55104943a3cc3b6f8b9ca079e6e00b09585d85aec54863a42fb4",
+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",


+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
+      "tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",


+    "command": "env MLX_METALLIB_PATH=/Users/snider/Code/core/go-mlx/dist/lib/mlx.metallib GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache /private/tmp/go-mlx-self/bin/lthn-mlx driver-profile -json -fast-gemma4-lane -cache-mode paged -context 4096 -trace-token-phases=false -prompt \"Write a short engineering note explaining why Gemma 4 12B Unified uses a 1024-token local sliding window and full global owner layers in a retained-state runtime.\" -max-tokens 192 -runs 1 -include-output=true -report-file /private/tmp/go-mlx-self/reports/gemma4-12b-6bit-sample-output.json /private/tmp/go-mlx-self/models/mlx-community-gemma-4-12B-6bit",
+    "generated_tokens": 192,
+    "visible_tokens": 192,
+    "output_token_ids_sha256": "d34765e9895731937ad93004503887835008d9fdb532f7da7cadb6ba2cc9327c",


… bytes The reactive-quant foundation: a loader-neutral, pure-Go detector that reads a model's quantisation from config.json (the group anchor) + the packed safetensors geometry (the bit-width), cross-checks them, and fails loud on mismatch — never a filename heuristic. deriveAffineBits pins bits from the weight/scales last-dim ratio (bits = 32·wLast/(sLast·group)); ResolveQuant returns a neutral QuantSpec {Format, Bits, GroupSize, Exclude} that every backend's kernel factory can react to. Verified against real packs: gemma-4-e2b q4→{affine,4,64}, q6→{affine,6,64}, bf16→{none}. Pure Go, no cgo — go-rocm/cuda/CPU reuse the identical read. Next: wire into the load order (before the registry model) + the Metal kernel factory so q4 stops dequantising to fp16. Co-Authored-By: Virgil <virgil@lethean.io>

/v1/admin/serve/status now reports a live memory block read per request, not at boot: active_bytes (held live), cache_bytes (allocator's retained-free pool), peak_bytes (high-water). The active-vs-cache split is the diagnostic that tells a real leak (active climbs across a long generation) apart from the allocator merely caching freed buffers (cache climbs, active flat). Used it to pin the decode-time memory growth: on E2B-q4 a 1600-token generation takes active 2.6→17.2 GB while cache stays ~0.3 GB — a live per-token leak, not quant and not the cache. Co-Authored-By: Virgil <virgil@lethean.io>

…roughput vs length RFC-CORE-008 §11 (Benchmarks as Hot-Path Validation): generates 128/512/1024/2048 tokens on E2B-4bit, reports peak/residual active GPU memory + tok/s per length. Flat peak across lengths = bounded decode working set; a climb = a per-token leak. First result settles the 17 GB serve blow-up: the raw decode loop is LEAK-FREE — peak flat at 160 MB, zero residual, 512→2048 tokens. The leak is in the serve / inference-TextModel wrapper, not model.Generate. This is the core-loop regression gate. Co-Authored-By: Virgil <virgil@lethean.io>

…e loop) Extends BenchmarkGenerate_ContextGrowth to load with the serve's memory-plan cache (KVCacheMode=paged) and sweep greedy + sampled/thinking. Result: the default cache is flat (~160 MB, 0 resid, 512→2048 tokens) but paged climbs 1.4→4.3→8+ GB — the PagedKVCache is not zero-copy and leaks ~per decode token. This is the serve's 17 GB blow-up, now reproduced in a unit benchmark + gated. Co-Authored-By: Virgil <virgil@lethean.io>

…e leak) The memory planner picked KVCacheMode=paged for the 64/96/128 GB Apple classes, but PagedKVCache (codex code) is not zero-copy — it reallocates+copies KV every decode token, leaking ~2.4 MB/token: a 1600-token serve generation climbed active GPU memory 2.6 → 17 GB. The default (rotating/fixed-sliding bounded) cache holds a flat ~160 MB working set across 512→2048 tokens. Route the three big-memory classes to KVCacheModeDefault. Verified: the serve-path benchmark goes from resid 6.5/8.9 GB → 0 at 1024/2048 tokens. PagedKVCache stays in the tree (dead, unselected) for a future zero-copy rewrite or removal. backend_growth_bench_test.go added as the serve-path (NewMLXBackend→adapter) memory regression gate; pkg/metal generate_growth gates the raw decode loop + documents the paged leak. Co-Authored-By: Virgil <virgil@lethean.io>

…paged Follow-up to b71115f — the 96GB plan test still encoded the codex paged choice; align it with the routed-around default cache. Full memory suite green. Co-Authored-By: Virgil <virgil@lethean.io>

The serve leak fix (b71115f) routed the 64/96/128GB machine classes off the broken PagedKVCache onto the bounded default cache and updated the memory/ subpackage tests — but the root mlx package keeps its own mirror of the plan assertions (memory_plan / local_tuning / register_metal), which still demanded paged. They were left red on dev: gated on memory/ last session, missed the cross-package twin. Flip the five stale assertions to expect KVCacheModeDefault (the bounded cache the planner now correctly returns). local_tuning's candidate CacheMode is "" by design when the plan picks the default — the code already guards it (emits WithKVCacheMode only when non-empty). Verified green across the full module (53 packages), not just the changed files. Co-Authored-By: Virgil <virgil@lethean.io>

…base shorthand) The root package spelled out SimpleSelfDistillationCodeBenchmark — a 34-char prefix repeated 342x — while the rest of the code already used the ssd shorthand: the files are ssd.go / ssd_eval.go, cmd/mlx defines ssdRecipe* types, and ssd.go emits the "ssd_*" metadata keys. Only the root identifiers hadn't caught up — exactly the "named weird, makes the code look more complicated" smell. Pure mechanical rename across 6 contained files (zero external consumers, verified). Doc-comment prose keeps the full "self-distillation" wording; only the identifiers shorten. Build + vet + full module test green. Co-Authored-By: Virgil <virgil@lethean.io>

…guessed planFit resolved quant config-first (config.json quantization block → JANG) then fell back to inferQuantBits, which scanned the *filename* for "q4"/"8bit" substrings. That fabricates a width from what a file is *called*: an untagged base bf16 model passed-by-ignorance as 0, and a mislabelled file could claim a quant it doesn't ship. Every real quantised pack declares its block in config.json (and post-download the packed-tensor geometry settles it via model.ResolveQuant), so the fallback only ever guessed. Delete the fallback + inferQuantBits + its hasASCIIUpper helper (appendLowerASCII stays — model-ID normalisation still uses it). quantBits now stays 0 = honest unknown when nothing real declared a width. New guard TestPlanHFModelFits_FilenameQuantNotConsulted_Good pins it: a "q4" filename with no config quant → QuantBits 0. The gguf-side reader (reads real tensor types, not names) is untouched. Part of #58. Co-Authored-By: Virgil <virgil@lethean.io>

…bytes, quant is descriptive PreferredQuantization was fugazi twice over: a per-RAM-class scalar (4 ≤64GB / 8 ≥96GB) that (1) gated model fit by a bits comparison (ident.QuantBits <= preferred) — a proxy for a question WeightBytes already answers, false-negativing an 8-bit model that fits fine in bytes on a small box — and (2) anchored a quant-RECOMMENDATION cluster (Quality/Fallback Quantization, QuantizationPolicy, QuantizationCandidates + the gemma4 q6/q8/q4 curated policy) selling "which quant to download". Snider's call: fit AND model-selection both pure bytes — tear the whole cluster out. Removed from memory.Plan: PreferredQuantization, QualityQuantization, FallbackQuantization, QuantizationPolicy, QuantizationCandidates, the QuantizationCandidate/QuantizationRole types, applyModelQuantizationPolicy + applyGemma4SmallQuantizationPolicy + gemma4SmallQuantizationCandidates, and the per-RAM template defaults. ModelQuantization* (the model's ACTUAL quant, read from its bytes) stays — descriptive, not a recommendation. Consumers rebased: the PlanModelFit gate no longer gates on quant (QuantizationOK is always true now — precision is descriptive, fit is a bytes question assessed by the WeightBytes-aware planner); the 3 ExpectedQuantization loader-hint passthroughs (local_tuning/memory_plan/register_metal) and the inference.MemoryPlan quant label now derive from ModelQuantization, not the dead scalar; hf's "below machine-class preference" note removed. Tests for the torn-out policy deleted; co-asserted live checks kept. Full module green (53 pkgs). Part of #58. Co-Authored-By: Virgil <virgil@lethean.io>

…e model With the quant-bits ceiling gone (it never gated honestly), the fit was left as architecture + a KV-only memory check — PlanModelFit's ident carries no weight bytes, so it couldn't ask the real question. Now, when ident.Path points at a local model, read it (model.Inspect → pack) and feed the planner the real WeightBytes, so "does it fit" is the honest sum: model weights plus the planned KV cache against the memory budget. Without a local model it falls back to the identity's declared dims — the best that can be asserted pre-download. This is the derive-from-truth ceiling: the model's own bytes answer the fit, not a RAM-class scalar. New skip-if-absent integration test pins it on a real model — a 1GiB budget cannot fit gemma-4-e2b (weights exceed it), 96GiB can. Full module green (53 pkgs). Completes #58's bytes-fit. Co-Authored-By: Virgil <virgil@lethean.io>

…el + Native) The root Model wrapped a private `nativeModel` interface (the 12-method metal engine contract), constructed only inside the package. That's the floor that blocks pulling cohesive concerns (session, train) into subpackages — their tests build `&Model{model: fake}` white-box and can't follow the code into a subpackage without a public construction path. Rename `nativeModel` -> `NativeModel` (the metal engine contract is now public), add `NewModel(NativeModel) *Model` (the construction seam — LoadModel stays the on-disk path) and `(*Model).Native() NativeModel` (nil-safe accessor subpackages build on instead of reaching the unexported field). The ~22 optional capability interfaces stay private — the root probes them internally via assertion, which still works through the public NativeModel value. Targeted rename only: the `runtime.nativeModel(ctx)` method and the `nativeModel, err :=` local in register_metal are untouched. Behaviour-preserving; new seam covered by TestNativeModel_Seam_Good. Full module green (2639 tests / 55 pkgs). First slice of #63's session/ extraction. Co-Authored-By: Virgil <virgil@lethean.io>

…ackage First stage of the base-package rework that unblocks the session/ folder (the session spine shares root<->metal conversion helpers with the root generate/KV paths, so session can't extract until those move to a package both import). kv_snapshot_convert.go was root-type-free already — it only bridges kv.Snapshot <-> metal.KVSnapshot — so it moves cleanly to a new low package dappco.re/go/mlx/kvconv that root + (future) session both import. Exported the 5 cross-package funcs (ToRootKVSnapshot / ToMetalKVSnapshot / ToMetalKVSnapshotCaptureOptions / RootKVHeadDType / MetalKVHeadDType); the 3 internal helpers (turboquant payloads, native-slab probe) stay unexported. kv stays metal-free (neutrality preserved — the bridge is its own package, not folded into kv). Call sites in backend/session/prompt_cache + tests updated. Full module green (2639 tests / 56 pkgs). Part of #63. Co-Authored-By: Virgil <virgil@lethean.io>

Second consolidation of the base-package rework toward session/. metalKV- SnapshotBlockSource (+ its coverage helper + 7 State-KV sentinel errors) was root-type-free already — it bridges State store + kv.StateBlockBundle -> metal.KVSnapshotBlockSource — so it joins the kv<->metal bridge in kvconv as MetalKVSnapshotBlockSource, reachable by both root and the future session pkg. Removed the funcs from prompt_cache.go + the 7 now-dead errMLXStateKV* vars from backend.go; relocated the two block-source tests into a kvconv white-box test (where they reach the package-local sentinel). Callers in prompt_cache/ session/session_agent + benches repointed. Full module green (2639 / 56 pkgs). Part of #63 (session is spine-cycled; base rework is the chosen fix). Co-Authored-By: Virgil <virgil@lethean.io>

… builder registry (#39)" This reverts commit 9993754.

…ixer (#39)" This reverts commit 37edcca.

The consumer the mixer registry was built for: a generic pre-norm SwiGLU transformer whose per-layer sequence mixer is resolved at LOAD time by the kind the config declares — through metal.MixerLoaderFor, no central switch, no edit to any model file. gemma4 stays the hand-written softmax model; this is the path a softmax+linear-attention hybrid (or a transformer-shaped pure-linear-attn model) runs. It closes the "static engine operates on what the config declares, 3 separate concerns" loop — quant + mixer + cache factories now have their config-driven consumer. - pkg/metal/softmax_loader.go: the KEYSTONE. A generic softmax-attention mixer (softmaxMixer) wrapping the SDK's neutral GQAAttention, self-registered as "full_attention". Without it a composed hybrid could not build its attention layers (only the 9 FLA mixers were registered). Family-agnostic — no gemma4 import; gemma4's softmax_mixer.go stays the shared-KV/runtime-mask variant. Reads rope_theta + scale from the layer *DenseConfig via MixerBuildCtx.Extra. Implements MixerCloser to free its projections on model Close. - pkg/metal/model/composed/: ComposedModel (metal.InternalModel). buildComposed is the disk-free core (unit-testable from synthetic weights, AX-11): resolve each layer's kind (layer_types per-layer, or model_type uniform), dispatch MixerLoaderFor, compose embed + [norm→mixer→add, norm→SwiGLU→add] + norm + (tied/lm_head) output. NewCache types each layer 1:1 by the mixer's declared State() — KV cache for softmax, recurrent holder for SSM/linear-attn. Registers the "composed"/"hybrid" loadModel arch; blank-imported in speculative.go. - pkg/metal/mixer_registry.go: MixerBuildCtx gains an Extra escape hatch (mirrors MixerCtx.Extra) — softmax needs the family config's rope_theta+scale the neutral TransformerConfig does not carry; recurrent mixers ignore it. - pkg/metal/mixer.go: MixerCloser optional capability — composed Close frees a mixer's own weights when present, best-effort, never a nil-deref. - pkg/metal/linear_load.go: LoadLinear routes every tensor through ResolveWeight so the model./language_model. aliasing is handled once, not per caller. Scope (cut 1, pinned): the wiring is proven (config→registry→compose→forward shape; HETEROGENEOUS softmax+recurrent dispatch driving heterogeneous cache typing — KV vs the recurrent holder — via a fake recurrent mixer; live loadModel registration round-trips "composed"/"hybrid"; loud refusal on missing weight / unregistered kind / ambiguous config). Per-FLA-family numeric correctness + their checkpoint weight subpaths are the validate-against-a-checkpoint step (AX-9) — a real Mamba/RWKV/hybrid checkpoint is wired to this loader when validated, not before. Out of cut 1: pure-Mamba/RWKV block topology (no-MLP), embedding scaling. 9 composed tests + full tree build green; 1104 tests across the touched + FLA consumer packages, no regression.

The GatedChunk recurrence oracle tests feed [B,H,L,D] directly and only assert head 0, so they bypass projectHeads (the [B,L,H*D]→[B,H,L,D] strided view) and mergeHeads (the inverse transpose). The state-threading test drives those paths but with identity weights and identical routing on both the chunk and step sides, so a static head-misroute cancels and the test stays green — exactly the silent-layout class that shipped undetected in a sibling sparse-mixer lane. TestGla_Mixer_HeadLayout_Good gives each head distinct input channels and a distinct gate, then builds the expected output with an independent Go oracle that splits the model dimension block-contiguously (head h ⇒ channels [h·D,h·D+D)), runs the gated recurrence per head, and re-assembles block-contiguously. Verified discriminating: perturbing the projectHeads stride to an interleaved split fails only this test, while all ten existing tests pass. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

The WKV7 recurrence has an independent float64 oracle (wkv7Reference) and the chunked path is checked against the sequential kernel, but Mixer.Forward's projection/reshape glue had only a prefill-vs-decode self-consistency test — which runs the same glue both sides and so is blind to a consistent layout bug (a transposed projection, a wrong per-head reshape, crossed r/k/v/a/b ports). forward_test.go reproduces only that glue in pure Go from the same weights (r/k/v/a/b projections, w = -exp(WProj), per-head split), defers the recurrence to the trusted wkv7Reference, and applies the head-merge + out-projection, then asserts the real chunked Forward matches within float32 noise. Covers both the zero-state prefill (L=3) and the decode-continuation (L=1 with prior) layouts. Closes tasks.lthn.sh/view.php?id=30 (rwkv7 numeric-Forward slice) Co-authored-by: Hephaestus <hephaestus@lthn.ai>

Pin moba.Mixer.Forward end-to-end against a pure-Go dense-causal-softmax reference (Q=K=V=x via identity projections), restricted to the tokens the kept blocks expose. MoBA Forward is exactly causal softmax with a block-sparsity mask, so the oracle recomputes that mask from first principles — the kept set is known by construction (geometry + scores), never read back from the mixer's own blockSelectMask, so the test is a real oracle rather than a tautology. Covers the Forward-level compositions the shape-only loader tests cannot see, and the past-bug edges: - L < BlockSize (nBlocks == 0): reduces to plain causal softmax - L == BlockSize (nBlocks == 1): block-grid path, still causal softmax - TopK == 0: self-block-only; a cross-block leak in the block->token expansion would diverge here - TopK == 1: a high-scoring past block is kept alongside the self-block Discriminating power verified by bug injection (a wrong allow-predicate fails the assertion). Tests gate on -tags metal_runtime; the un-tagged go test ./... stays green. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

The RetentionChunk oracle tests feed [B,H,L,D] directly and only assert head 0, bypassing projectHeads (the [B,L,H*D]→[B,H,L,D] strided view) and mergeHeads; the state-threading test drives those paths with identity weights and identical routing on both sides, so a static head-misroute cancels. TestRetnet_Mixer_HeadLayout_Good gives each head distinct input channels and a distinct per-head decay γ (DecayLn[head]), then builds the expected output with an independent Go oracle splitting the model dimension block-contiguously (head h ⇒ channels [h·D,h·D+D)) and running the retention recurrence per head with that head's γ. The per-head decay is the extra layout lock retention has over GLA — a head swap also pairs a head's data with the wrong γ. Verified discriminating: perturbing the projectHeads stride to an interleaved split fails only this test, all nine existing tests pass. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

…(#33) The quant arm of the factory trilogy gains its non-affine formats. Evidence-led (a round-trip probe ran first): MLX's mlx_quantize supports mxfp4 (the MX format — E8M0 block scale over E2M1) and nvfp4 (NVIDIA FP4), but returns a 2-array (w, scales) result because these formats carry NO zero-point — and the Go Quantize binding hardcoded affine's 3-tuple and threw the valid result as an error. q4_0 is NOT an MLX quantization mode (mlx_quantize rejects it outright); it is out of this cut — a real q4_0 needs a GGUF-specific packer, not mlx_quantize. - quantize_op.go: Quantize accepts the 2-array scale-only return (biases nil) as well as affine's 3-tuple. The dequantize + quantized-matmul paths already nil-tolerate biases (optionalArray), so the binding's array-count assumption was the only blocker to the FP4 quantise direction. affine is unchanged (still the 3-tuple) — backward compatible. - quant_fp4.go: register mxfp4 + nvfp4 loaders (mirrors quant_affine.go), making the quant registry the authoritative supported-format set rather than leaving these to the generic fallback. The loader assembles the *Linear; the mode string drives the MLX kernel (the affine pattern). - quant_schemes_test.go: the numeric oracle — quantise→dequantise round-trips within a 4-bit step (measured affine 0.075, mxfp4 0.100, nvfp4 0.094 on the [-0.6,0.7] test weights); registration is first-class; q4_0's MLX-unsupported boundary is pinned in a test so its absence reads as deliberate, not forgotten. pkg/metal green (1052 tests), full tree builds.

…ally The DeltaRuleChunk oracle tests feed [B,H,L,D] directly and only assert head 0, bypassing projectHeads (the [B,L,H*D]→[B,H,L,D] strided view) and mergeHeads; the state-threading test drives those paths with identity weights and identical routing on both sides, so a static head-misroute cancels. TestDeltanet_Mixer_HeadLayout_Good gives each head distinct input channels and a distinct per-token write strength β, then builds the expected output with an independent Go oracle (deltaReference, which L2-normalises keys exactly as the kernel does) splitting the model dimension block-contiguously (head h ⇒ channels [h·D,h·D+D)) and running the delta-rule recurrence per head. Because keys are L2-normalised over each head's own D-block, an interleaved split also changes a head's key direction and hence its read. Verified discriminating: perturbing the projectHeads stride to an interleaved split fails only this test, all eleven existing tests pass. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

The NSA selection branch's keep-all path (selectionMask with selectBlocks >= nBlocks) returned the score-laden `scored` tensor instead of a clean 0/-inf keep-mask. Because the caller ADDS that mask to the attention logits, the per-block ranking score leaked in as an additive bias, sharpening the selection-branch softmax toward high-scoring blocks. This fires whenever selectBlocks >= nBlocks = L/BlockSize — i.e. across the entire short- and medium-context regime (e.g. SelectBlocks=16, BlockSize=64 → every sequence up to ~1024 tokens), not a degenerate edge. The top-n path already returned a clean 0/-inf mask via WhereScalarArray; the two paths were inconsistent and keep-all was wrong. Fix: free `scored` and return causalMask.Clone() (causalMask is already exactly the 0/-inf keep-mask). Found by a new Forward-level numeric oracle: the existing TestSelectionMask_* tests all use selectBlocks < nBlocks (the top-n path), so keep-all was never exercised. Tests added (gate on -tags metal_runtime; un-tagged go test ./... stays green): - forward_oracle_test.go: pin nsa.Mixer.Forward against an independent pure-Go reference matched to NSA's three-branch gated-blend mechanism — short-seq L<BlockSize (routes to sliding, nBlocks==0 past-bug edge), per-branch isolation via a constant-bias gate projection (sliding / compression / selection), and a full three-branch blend weighted by the exact sigmoid gates (the test that caught the bug). The selection branch is exercised in the keep-all regime so it has a clean reproducible reference, never read back from the mixer. - nsa_test.go: TestSelectionMask_KeepAllIsCleanMask_Good pins the keep-all branch to a clean 0/-inf mask directly (distinct block scores make a score-leak loud). Both new tests verified to fail on the pre-fix code and pass after. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

The load-bearing heart of the P-phase on-disk fuse: FuseLoRAIntoWeights folds a trained adapter's deltas into a dense base weights MAP (not a live model), so a fused checkpoint serialises straight back out with SaveSafetensors — no reverse model-walk. Builds on the existing primitives (Matmul/MulScalar/Add, the adapter's {layer}.lora_a/.lora_b format, Scale = alpha/rank). - Each targeted base weight W → W + (B·A)·scale (dense); every other tensor carried through unchanged; the fused-layer list returned for diagnostics. - Dense-base only: a quantized base is refused loud (a fused layer is dense, and one config-level quantization block cannot mix fused-dense with quantized neighbours — dequant-merge-then-requant the whole model is a separate path). - Loud, not silent: a delta with no base weight, a quantized base, or a base/delta shape mismatch is an error, never a wrong-shape or skipped merge. Numeric oracle test (rank 1, computed by hand independently of the impl) pins W+(B·A)·2; three guards pin the loud-failure paths. Orchestration (FuseModelDir: read base+adapter dirs, save fused dir, copy config/tokenizer) and the cmd/mlx fuse verb follow on this core. pkg/metal green.

Pin mla.Mixer.Forward end-to-end against a pure-Go latent-expansion reference: decompress per-head K/V from the KV latent with the per-head-interleaved split, run causal SDPA per head, merge, project out. TestUpProjectKV already pins the split in isolation at HeadDim=1; this exercises the split WIDTH flowing through splitHeads -> attendLatent -> OProj at heads=2, HeadDim=2 — the composition the kv_b_proj per-head-interleaved-vs-block-concatenated bug actually corrupts. Construction uses identity WDQ/WDKV/WUK so the latents are exactly x, a column-selector WUQ for a traceable query, and identity OProj; distinct per-column sentinels make a mis-routed split numerically loud. The oracle reads K/V with the same per-head-interleaved layout independently of the mixer. Discriminating power verified: a block-concatenated split in the oracle diverges by ~2.0 (pairs head 0's K with head 1's V), where a shape-only test would still pass. Gates on -tags metal_runtime; the un-tagged go test ./... stays green. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

The SSD recurrence has an independent float64 oracle (scanReference) and the chunked path is checked against the sequential kernel, but Mixer.Forward's block assembly — the in-proj z|xBC|dt split, the causal depthwise conv, the group->heads expansion, the dt activation, the gated norm — had only a prefill-vs-decode self-consistency test, which runs the same glue both sides and so cannot see a consistent layout bug (a wrong split offset, the B/C ports crossed, a group routed to the wrong heads). That is the bug class a sibling sparse-mixer lane shipped. forward_test.go reproduces only that glue in pure Go from the same weights, encoding the INTENDED layout with explicit indices (not transcribing the op sequence), defers the recurrence to the trusted scanReference, and applies the gated norm + out-proj, then asserts the real chunked Forward matches within 1e-4. Two geometries: G=1 (the package fixture's degenerate broadcast) plus a PROPER H=4,G=2 group count whose repeat=2 map is the only case that can catch a group-routing bug. The B/C in-proj rows carry a strong per-group scale so a swapped group produces gross error (verified: a reversed group map fails by ~0.026, a reversed conv kernel by ~0.009 — both >>1e-4), keeping the test from rubber-stamping a layout bug the way the shape-only test it replaces would. The gated-norm component is transcription (mirrors the mixer's SiLU(z)*RMSNorm(y) norm-before-gate form), so this pins layout, not gate-order math. The existing prefill-vs-decode test is kept — it guards the conv-ring + ssm-state threading this static oracle does not. Closes tasks.lthn.sh/view.php?id=30 (mamba2 numeric-Forward slice) Co-authored-by: Hephaestus <hephaestus@lthn.ai>

Add one Forward-level absolute-value test for gsa.Mixer.Forward against a pure-Go scalar slot-recurrence reference that INCLUDES the SiLU output gate and the output projection. The recurrence kernel is already pinned independently, and TestForward_ChunkedDecodeMatchesSinglePass is a consistency check (single-pass == chunked) that is invariant to wiring/gate bugs which cancel across both calls. This test fixes the absolute two-token output, so a dropped/transposed gate, a wrong gate operand, or an OProj slip is caught. All-identity mixer (q=k=v=f=gate=x) makes the math determined solely by the gated slot recurrence + SiLU gate; the oracle recomputes both in pure Go. Discriminating power verified: dropping the SiLU gate from the oracle diverges by ~0.16. Gates on -tags metal_runtime; the un-tagged go test ./... stays green. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

The first rwkv7 Forward oracle (c9de61b) passed on correct code but ALSO passed a crossed a↔b or r↔k port — a mutation check showed only ~4e-9 divergence under the original tinyMixer fixture + 2e-3 tolerance, well inside tolerance. That is the shape-only trap one rung up: green, but not load-bearing for the layout bugs the test exists to catch. Two causes, both fixed: (1) tinyMixer's near-uniform tiny projections make every port's contribution indistinguishable — replaced with oracleMixer, a fixture with a DISTINCT magnitude scale per r/w/k/v/a/b port. (2) the a/b learning-rate transition b⊗(aᵀS) is zero at step 0 and small for the first steps, so an a↔b swap only diverges once the state accumulates — the Good prefill is now L=6, not 3. Tolerance tightened 2e-3 → 1e-4 to match the mamba2 oracle and the brief. Verified by mutation: a reversed a↔b port now fails by ~0.05, a reversed r↔k port by ~0.23 — both >>1e-4 (was ~4e-9). The residual on correct code is ~7e-12. Closes tasks.lthn.sh/view.php?id=30 (rwkv7 numeric-Forward hardening) Co-authored-by: Hephaestus <hephaestus@lthn.ai>

TestForward_TopKKeepsScoredPastBlock_Good has a single strictly-past block (nBlocks=2, self=block1), so keep-top-1 and keep-all-past coincide and the top-K RANKING is never exercised. Add TestForward_TopKDropsLowerPastBlock_Good: three blocks (L=6, BlockSize=2) where a block2 query has two past blocks, both aligned with the query so both would carry real softmax mass, and top-1 must keep the higher-mean-scoring block0 while dropping block1. A keep-all-past or mis-ranked selection now diverges (verified by injecting keep-all-past: the test fails ~0.007 > 1e-4). This brings the most complex oracle (mobaSelectionOracle, whose ranking was previously untested) to the same discriminating-power standard as the rest. Gates on -tags metal_runtime. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

…(#35) The serializer half of the P-phase fuse, on top of the fuse core: read a dense base model dir + a trained LoRA adapter dir (engine format: adapter.safetensors + adapter_config.json, alpha/rank scaling), fold the adapter in at the weights-map level, write the fused dense model.safetensors to outDir, and carry the base's servable sidecars (config.json + tokenizer files) across. A stale safetensors shard index is intentionally not copied (the output is a single re-consolidated file). Reuses the existing primitives end to end: LoadModelWeights (base), parseAdapterConfig + loadAdapterWeights (the same loaders adapter-resume uses), FuseLoRAIntoWeights (the tested core), SaveSafetensors (write). Refuses loud when the adapter matches no base layer (an empty fuse is a config error, not a silent base-identical copy). Hardening found by the round-trip test: FuseModelDir clears the sticky, process-global MLX LastError at entry. LoadModelWeights reads LastError after a *successful* LoadSafetensors, so a benign error left by a prior unrelated op would falsely fail this fresh load — reproduced when the fuse runs after another MLX operation. Now robust regardless of prior engine state. Round-trip test writes synthetic base + adapter dirs, fuses, reloads the output, and asserts W+(B·A)·scale survived disk + config.json carried across; a no-match guard pins the loud-refusal path. pkg/metal green.

Wires metal.FuseModelDir to the CLI, completing #35 (fuse verb + serializer). lthn-mlx fuse -base <dir> -adapter <dir> -out <dir> Folds a trained LoRA adapter into a dense base and writes the fused model dir; the serving artefact afterwards needs no adapter. Thin verb over the tested serializer — flag validation + clean exit codes (2 = missing flags, 1 = FuseModelDir error, 0 = fused N layers). Listed under "Transform a model" in the usage. Tests pin the missing-flag and no-model exit paths; the fuse numerics + on-disk round-trip are covered in pkg/metal. This closes the P-phase loop end to end: train a LoRA (sft/ssd) -> adapter.Save -> fuse -> a servable dense model dir, no separate adapter at serve time.

The composed runner was wired but orphaned: speculative.go (the serve/generate package) blank-imported `composed` but none of the FLA mixer families, so at serve time MixerLoaderFor("mamba2"/"gla"/…) returned nil and a config-composed Mamba/RWKV/linear-attn/sparse hybrid was refused at load — the engine's mixer registry carried only the generic softmax. The 9 numerically-validated mixers sat in the tree unreachable from the running engine. Blank-import mamba2/rwkv7/gla/retnet/deltanet/gsa/nsa/moba/mla so their loaders self-register. A test in the serve package asserts all 10 kinds (softmax + 9 FLA) now resolve. Inert for existing models (registry is consulted only when a config declares that layer kind); a normal gemma4 serve never touches them. This is half of closing the composed runner's reason. The remaining half is weight-subpath resolution: the FLA loaders ask for unqualified leaves (ctx.Linear("in_proj")) while a real checkpoint nests them under a per-model mixer sublayer (model.layers.N.mixer.in_proj) — that mapping is model-specific and wants a target hybrid checkpoint to pin + validate against.

Closes Gap B of the composed-runner reason: it can now resolve a real hybrid's mixer weights, where the mixer nests under a model-specific sublayer (model.layers.N.mixer.in_proj, or .self_attn., or .mamba.) that varies by model, not by family — so a family loader cannot own its subpath. The composed model discovers each layer's mixer sublayer FROM the checkpoint: the mixer and the MLP nest sub-projections (>=3 components past the layer); the norms do not; excluding the MLP leaves the mixer. This is not a candidate-guess — it reads the sublayer actually present, and FLA leaf names (in_proj/A_log/receptance) are unique within a layer, so resolution is unambiguous. Exactly one → use it; none → bare leaves (no nesting); TWO OR MORE → loud refusal, never a random map-order pick (the silent wrong-geometry trap this whole pass has been killing). All mixer loaders now ask for BARE leaves — softmax_loader.go drops its self_attn. qualification to match the FLA loaders, and the consumer owns the layout. A softmax+recurrent hybrid resolves self_attn for one layer and mixer for another in one model (TestComposed_SubpathResolution_Good); a layer with two candidate sublayers is refused (TestComposed_AmbiguousSubpath_Bad). BOUNDARY (honest): this proves the MECHANISM against the standard single-mixer layout. End-to-end correctness against a real hybrid checkpoint's exact naming + numerics is still the checkpoint-gated step (with the mamba2 gate-order, #60). The composed runner + 9 registered mixers are now wired end to end — config → registry → discovered-subpath load → forward — pending a real checkpoint to run. composed 11 tests green; pkg/metal 1058 green; full tree builds.

…step ssd samples the frozen base + scores each self-output at birth and STOPS at the scored trace (ssd-captures.jsonl + ssd-samples-score.jsonl); it no longer runs a fine-tune. A lab step refines the trace into the SFT artifact, which a separate sft run trains on ('do not run sft after ssd'). RunSSD drops the TrainSFT call; SSDRunner.TrainSFT + SSDResult.SFT removed; the ssd verb drops the training flags (rank/lr/epochs/eval/score-cascade) and reports the trace + next sft step; tests reworked (3 train-handoff tests removed, 7 rewritten to assert the trace). Co-Authored-By: Virgil <virgil@lethean.io>

…r's shape (#39) The cache registry (RegisterCache / CacheComputeFor) maps a scheme mode to a builder; NewCacheForMixer / NewCacheForMode are the construction front-door that resolves the mode a mixer needs and builds it — the cache counterpart to MixerComputeFor and the quant-loader registry, the same component-factory shape. A mixer names its cache via the optional CacheModer (MLA -> "mla-latent"); a mixer without a bespoke shape gets the default for its StateKind (growing KV cache for StateKVCache, recurrent holder for StateRecurrent). This lets the composed runner build per-layer caches by shape instead of hand-picking concrete cache types. Construction-only fixture tests (no Metal runtime), all green. Co-Authored-By: Virgil <virgil@lethean.io>

…unner builds via it MLA implements CacheModer ("mla-latent"); the composed runner's NewCache now builds each layer's cache through NewCacheForMixer instead of hand-picking NewRecurrentCache / NewKVCache by State. Behaviour-preserving for softmax (KV) and recurrent (holder) — the heterogeneous cache-typing test guards it — and routes MLA to its latent mode. The factory is now the single per-layer cache construction path. Co-Authored-By: Virgil <virgil@lethean.io>

MLA caches ONE compressed latent per token, not a K/V pair — the two-tensor KVCache the mla-latent scheme aliased would store it twice and lose MLA's footprint win. latentKVCache stores the single latent once and concatenates it across decode chunks along the sequence axis; the mla-latent scheme now builds it through the KV factory (CacheComputeFor -> NewLatentKVCache), so it's a first- class cache shape the factory offers. TDD: factory resolves "mla-latent" -> *latentKVCache (no runtime); concat-across- chunks correctness on the runtime (prefill [1,2,3] + decode [1,1,3] -> [1,3,3], offset 2 -> 3, values appended). MLA's Forward persist-concat through this cache is the next step. Co-Authored-By: Virgil <virgil@lethean.io>

) MLA's Forward now persists this chunk's compressed KV latent through ctx.Cache and up-projects the FULL concatenated latent, so a decode step attends over the whole history instead of only the chunk it was handed — the last piece of #1's decode-caching for MLA. The query stays the current chunk (L); K/V span the cached history (totalL). The latent cache adopts cKV and returns its own growing handle, so Forward hands ownership over and frees only the uncached one-shot transient. No RoPE → the latent is position-independent, so up-projecting the concatenated latent is per-row identical to up-projecting a one-shot latent. Scope: prefill + single-token decode (L=1 — the lone query legitimately sees every key, mask nil). L>1 attention over PRIOR cached history needs an [L,totalL] mask MLA does not yet build — the named next piece. TDD: TestForward_DecodeMatchesPrefill_Good — fresh latent cache prefills 2 tokens then decodes the 3rd; decode output must equal the no-cache full-prefill last-token output within mlaTol. Red before (cache offset 0, want 3), green after. All 8 mla tests + cache-factory/composed-heterogeneous guards pass under -tags metal_runtime; vet + default build clean. Co-Authored-By: Virgil <virgil@lethean.io>

…#1) d333ac6 named chunked-prefill-over-history (an L>1 chunk appended to a non-empty cache) as the unsupported next piece. It is reachable in principle — prompt_cache.go chunks prefill through the shared ForwardMasked interface MLA rides — and would either broadcast-crash (caller's [L,L] mask vs [L,totalL] scores) or silently leak acausally (nil mask, no within-chunk causal masking). Forward now panics at the top when Cache.Offset() > 0 && L > 1, before any allocation — converting the documented limitation into an enforced invariant that fails loud instead of mis-attending behind a green suite. The reachable paths today are untouched: prefill keeps L == totalL (empty cache), single-token decode keeps L == 1. TestForward_ChunkedPrefillOverHistory_Bad warms the cache with one token, feeds an L=2 chunk, asserts the panic. All mla tests green under -tags metal_runtime; vet clean. Co-Authored-By: Virgil <virgil@lethean.io>

) When the caller supplies no mask (the production composed path passes nil per chunk), Forward now builds its own causal mask via MultiTokenCausalMask with offset histLen = totalL - L, the way gemma4's attention does: row i (query at absolute position histLen+i) attends keys 0..histLen+i. That is a [L,L] causal mask for prefill (histLen 0), the [L,totalL] offset-causal mask for chunked prefill over prior history (histLen>0), and no mask for single-token decode (L==1, the lone query sees every cached key). This completes MLA's causal masking across all chunk shapes and retires the interim panic guard (174fa4a). A caller that DOES supply ctx.Mask owns its shape ([.,.,L,totalL]); MLA honours it verbatim. The one residual edge — a non-nil [L,L] mask fed during chunked prefill — matches gemma4's caller contract, and production passes nil so the builder runs. TDD: TestForward_ChunkedPrefillMatchesOneShot_Good — prefill 4 tokens in two L=2 chunks (offsets 0, 2) must equal a one-shot prefill token-for-token; chunk2 == one-shot[2:4] is where the offset-causal mask earns its keep. Verified red with the builder disabled (chunks attend acausally), green with it. Replaces the panic test. Full mla suite (13) + composed-heterogeneous guard green under -tags metal_runtime; vet + default build clean. Co-Authored-By: Virgil <virgil@lethean.io>

…(gsa/gla/retnet) The recurrent-state holder (#39) is built and wired: gsa.go's Forward and gla/retnet's mixer.go Forwards all read prior state from ctx.Recurrent() and write the advanced state back via SetRecurrentState (pinned by each package's chunked-recurrence-matches-single-pass test). But three package-doc comments still described the holder as not-yet-landed: - gsa.go: "the recurrent-state holder is not built yet ... Forward starts each chunk from a zero state and marks where the cached state will be read/written once the holder lands" - gla.go / retnet.go: "the decode loop can carry it across chunks once #1's recurrent-state holder lands" All three now describe the actual threading. No code change. mamba2/rwkv7/ deltanet already document the holder correctly — their "not built here" notes are about custom perf kernels (chunked SSD scan / WKV7), a separate honest unfinished concern, left as-is. Build + vet clean. Co-Authored-By: Virgil <virgil@lethean.io>

…el — not built" comments scan.go (mamba2) and recurrence.go (rwkv7) — the sequential-reference files — each claimed the length-parallel chunked form "needs a custom Metal kernel — flagged, not built here." Both are doubly stale: the chunked forms ARE built (SSDScanChunked in mamba2/chunk.go, WKV7Chunked in rwkv7/chunk.go), wired for prefill (mixer.go dispatches chunked for L>1, sequential for L==1 decode), and tested against the sequential oracle (chunk_test.go _Good/_Bad/_Ugly) — and neither needs a custom Metal kernel: they are composed from ops/linalg (mamba2 = CumSum/Exp/Matmul segsum; rwkv7 = matmuls + per-window TriInv), the deliberate no-custom-kernel design for chunked-recurrence mixers. Comments now point at the built chunked paths. No code change. Same stale class as the gsa/gla/retnet holder-comment fix (b6fbca8). Build + vet clean. Co-Authored-By: Virgil <virgil@lethean.io>

sonarqubecloud · 2026-06-14T18:01:13Z

Quality Gate failed

Failed conditions
4 Security Hotspots
5.8% Duplication on New Code (required ≤ 3%)
E Security Rating on New Code (required ≥ A)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

coderabbitai Bot requested changes May 20, 2026

View reviewed changes

Comment thread go/backend.go Outdated

github-advanced-security AI found potential problems May 20, 2026

View reviewed changes

coderabbitai Bot approved these changes May 22, 2026

View reviewed changes

github-advanced-security AI found potential problems May 24, 2026

View reviewed changes

Comment thread scripts/state_book_from_phase0.py Fixed

github-advanced-security AI found potential problems May 31, 2026

View reviewed changes

Comment thread go/cmd/mlx/embed_metallib.go

_ = os.Setenv("MLX_METALLIB_PATH", dst)

return

}

if err := os.MkdirAll(dir, 0o755); err != nil {

github-advanced-security AI found potential problems May 31, 2026

View reviewed changes

github-advanced-security AI found potential problems Jun 2, 2026

View reviewed changes

github-advanced-security Bot found potential problems Jun 5, 2026

View reviewed changes

Snider and others added 14 commits June 7, 2026 14:06

test(memory): M3Ultra96GB plan asserts default cache, not the broken …

5541126

…paged Follow-up to b71115f — the 96GB plan test still encoded the codex paged choice; align it with the routed-around default cache. Full memory suite green. Co-Authored-By: Virgil <virgil@lethean.io>

Snider and others added 29 commits June 13, 2026 23:22

Revert "feat(gemma4): load-time mixer resolution — config MixerKind +…

e6050d0

… builder registry (#39)" This reverts commit 9993754.

Revert "feat(gemma4): unbind decoder loop from the concrete softmax m…

6197ac2

…ixer (#39)" This reverts commit 37edcca.

Conversation

Snider commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 14, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Snider commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading