Skip to content

feat(llama-cpp-localai-paged): paged KV cache llama.cpp backend + cross-request prefix sharing + GB10 decode optimization [WIP]#10462

Open
localai-bot wants to merge 180 commits into
masterfrom
worktree-feat+paged-attention
Open

feat(llama-cpp-localai-paged): paged KV cache llama.cpp backend + cross-request prefix sharing + GB10 decode optimization [WIP]#10462
localai-bot wants to merge 180 commits into
masterfrom
worktree-feat+paged-attention

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Status: draft / WIP - opened to track ongoing GB10 enterprise-serving work. Large branch (kernel experiments + analysis + the shippable feature); will be curated before any merge.

What this is

Vendored, opt-in paged KV cache + cross-request prefix sharing for the llama.cpp backend, plus GB10 (consumer Blackwell, sm_121) decode-path optimization and the supporting analysis. All paged behaviour is gated by LLAMA_KV_PAGED (env) / the kv_paged server option and is off by default - stock builds are byte-identical.

Shippable feature pieces

  • backend/cpp/llama-cpp/patches/paged/0001-0011 - vendored llama.cpp patch series, applied behind the LLAMA_PAGED build flag (patches/paged/, default on; LLAMA_PAGED=off gives a clean upstream checkout). Isolated in prepare.sh + Makefile with a sentinel guard against double-apply.
  • grpc-server.cpp - kv_paged per-server option (0005) + cross-request prefix share wired into update_slots (0008).
  • core/backend/hardware_defaults.go, pkg/xsysinfo/gpu.go - hardware-aware default consolidation.

Key results (measured on DGX Spark / GB10, Qwen3-32B NVFP4)

  • Prefix sharing (RAG / system-prompt fan-out): the cross-request cache reaches the server - concurrent shared-prefix requests skip recompute, 15-25x burst-wall prefill reduction (K=16/32), reuse confirmed (ref_cnt=K, suffix-only prefill).
  • Decode kernel: an in-kernel block-table read (0009) deletes the per-step gather regression; routing the GQA-grouped tile kernel by default (0011) brings paged decode to stock parity (within 1.8%), growing to -6.1% at 16k ctx. Token-correct (CPU byte-identical, GPU within the CUDA batch-shape non-determinism band).
  • Honest framing: the earlier "6x decode gap vs vLLM" was a measurement artifact (the pre-0009 gather regression + a warmup/server number). The real steady-state decode is GPU-bound at the LPDDR5x bandwidth wall (~50% irreducible weight-read floor), ~1.2x from vLLM at the kernel level. Remaining throughput headroom is at the scheduler/serving layer, not the kernel.

Analysis docs live under backend/cpp/llama-cpp/patches/paged/*.md and backend/cpp/llama-cpp/paged/*.md.

Next

  • Scheduler/serving lever (continuous batching + chunked prefill + paged-KV capacity) for aggregate throughput.
  • Correctness hardening (mask-pad invariant assert) + CUDA-graphs confirmation (graphs are already ON in serving).

Not for merge as-is

This branch also contains banked W4A16/Marlin kernel experiments and NVFP4/MXFP4 quality analysis that informed the direction but are not part of the feature. Those will be dropped/split before merge.

mudler added 30 commits June 19, 2026 08:26
Host-side paged-attention block manager ported faithfully from vLLM V1
(block_pool.py, kv_cache_utils.py, single_type_kv_cache_manager.py):

- KVCacheBlock + intrusive LRU FreeBlockQueue (O(1) middle removal)
- BlockPool: get_new_blocks / touch / free_blocks eviction ordering /
  cache_full_blocks / lazy eviction on reuse
- PagedKVManager: on-demand allocate, block_table, slot arithmetic
  (slot = block_id*block_size + offset), free
- Prefix caching: chained block hashing + find_longest_cache_hit
  (first-miss stop), enabling automatic cross-tenant prefix sharing

Pure C++17, zero ggml/llama.cpp dependency, unit-tested to vLLM behavioral
parity (4/4 suites green). Parity is on algorithm/behavior, not hash bytes.

Phase 0 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Phases 1-5 (ggml storage, gather-to-scratch read path, Gate 0 correctness,
benchmark wins, prefix-share serving) follow.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Validate the paged KV read/write path at the ggml-op level, driven by
PagedKVManager:

- write: ggml_set_rows(pool, k_src, slot_mapping)  scatter K rows by slot
- read:  ggml_get_rows(pool, gather_idx)           gather a seq's slots into
         contiguous scratch (the tensor an attention kernel consumes)

The test forces a non-contiguous, out-of-order physical block layout
(allocate seqA+seqB, free seqA, reallocate seqC -> blocks [2,1,5]) and
proves gather(write(x)) == x plus cross-sequence isolation in the shared
pool. This de-risks the central question (does slot-addressed paged storage
round-trip correctly through ggml) before the llama-graph integration.

Pool is statically allocated via ggml_backend_alloc_ctx_tensors, mirroring
how llama.cpp allocates its KV cache. CPU backend, no new ggml op.
Built against ggml from the vendored llama.cpp checkout.

Phase 1 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Retire the central numeric risk from the design: feeding gather-to-scratch
KV (a sequence whose blocks are non-contiguous in the shared pool, [2,1,5])
into ggml's standard attention ops produces correct attention.

Path under test: set_rows write -> get_rows gather (K and V) ->
mul_mat(K,Q) -> soft_max_ext -> mul_mat(V^T, probs). Result is compared
against an independent host-computed softmax attention over the same K/V/Q.
Max abs error ~7.5e-08 (n_kv=48, d=8, n_q=4).

This proves the paged read path is numerically sound on CPU with no new
ggml op. Remaining: wire build_attn_paged into llama-graph.cpp and validate
Gate 0 (token-identical greedy generation in a real model).

Phase 2 (core) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Quantify the two multi-tenant wins that are properties of the host-side
block model (vLLM-parity), independent of the in-model compute path:

  WIN 1 concurrency capacity @ 512-block budget
    contiguous (reserve n_ctx/seq): 4 sequences
    paged (on-demand blocks):       37 sequences
    --> 9.2x more concurrent sequences

  WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix)
    prefix-cache OFF: 2176 physical blocks
    prefix-cache ON:  192 physical blocks
    --> 11.3x less KV memory

WIN 2 (throughput) is deliberately reported as PENDING: it requires the
paged gather-read path wired into llama-graph.cpp (Gate 0) and is not
measurable at the allocation layer. The win-1 baseline is per-sequence
n_ctx reservation (stream mode); llama.cpp's unified cache already shares
one pool, so the honest win there is on-demand sizing + prefix dedup.

Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Capture verified state (P0 manager parity, P1 ggml write/gather, P2 attention
numerics 7.5e-08, P3 capacity 9.2x + prefix-sharing 11.3x) and the exact
remaining work: wire build_attn_paged into llama-graph.cpp and validate
token-identical generation on Qwen3-0.6B (Gate 0), then win-2 throughput.

Records the integration seams (create_memory, find_slot, get_k/get_v,
build_attn, mask) and the honest caveats (unified cache already shares a
pool; vLLM's classic kernel is deprecated) so the next session starts warm.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…KV placement

Wire paged, non-contiguous fixed-size BLOCK placement into the real
llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate
Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to
the contiguous cache while its KV is physically scattered across permuted
blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via
LLAMA_KV_PAGED_DEBUG, not a silent fallback.

This retires the correctness premise of paged attention IN THE MODEL (not
just at the ggml-op level): attention is invariant to physical KV placement,
because reads use per-cell pos/seq metadata for masking. The patch lives at
patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f).

Scope: storage/placement layer, single sequence. Remaining (P4): the
gather-read compute path (attend only a seq's own blocks) for the throughput
win, and the multi-sequence driver. README updated with repro + status.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency
sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8
MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the
lever plan.

Measured wins:
- Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill
  plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48),
  near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap).
- Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch.

Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8
GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and
the complete paged attention (on-demand alloc + gather-read + continuous
batching + prefix sharing). Honest scope: each is multi-week kernel/systems
work.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
On NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10/DGX Spark) a larger
physical batch (n_ubatch) materially lifts MoE prefill throughput - measured on
a GB10 with Qwen3-30B-A3B to lift the prefill ceiling and saturate at ~2048.

When a model config leaves `batch:` unset, EffectiveBatchSize now picks 2048 on
Blackwell instead of 512; explicit `batch:` always overrides. Detection is a
shared, cached Go helper (xsysinfo.IsNVIDIABlackwell, nvidia-smi compute_cap
>= 12). Logic is isolated in core/backend/hardware_defaults.go and applied at
the common ModelOptions builder, so it covers the C++ llama.cpp backend too.

Measured (GB10, Qwen3-Coder-30B-A3B MXFP4): prefill ub512 2994 -> ub2048 3316
t/s; saturates past 2048. Also recorded in the DGX gap plan: 4-bit quant alone
captures the decode win (Q4_K_M 93.5 >= MXFP4 86.4 t/s), MXFP4's only edge is
prefill via Blackwell FP4 tensor cores.

Tests: hardware_defaults_internal_test.go; existing NBatch specs pinned to the
no-Blackwell branch for determinism.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Prefill doesn't scale with bigger single prompts (attention O(N^2)); real gap
is batched MoE prefill (B=32: 27x vs vLLM, ~22 effective TFLOP/s). nsys pins
Lever 3 target: mul_mat_q<MXFP4> MoE GEMM 37% + un-fused act-quant 8%; native
FP4 MMA already engaged, inefficiency is the per-expert thin-tile scheduler.
Q4_K_M matches MXFP4 on decode (decode win is generic 4-bit); MXFP4's only edge
is prefill. Auto-ubatch=2048 on Blackwell shipped (PR #10411).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…m ggml issue draft

Plan A (Lever 3): phased path to FP4 MoE GEMM parity — cheap tweaks, act-quant
fusion, then the real lever (tcgen05/CUTLASS grouped GEMM), full-model FP4.
Plan B (paged attention): on-demand pool, gather-read + Gate 0, continuous
batching, prefix sharing; benchmark in memory-pressured/mixed-length regimes.
Upstream issue draft: GB10 numbers, nsys profile, ruled-out config knobs,
tcgen05 proposal.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
static_assert(nwarps*tile_C::I == mmq_y) locks nwarps=8 for mmq_y=128; can't
raise occupancy without co-scaling mmq_y (blows Blackwell smem). MMQ kernel is
not freely tunable -> parity needs the tcgen05/CUTLASS rewrite, not knobs.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…s from-scratch

No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/
draft); CUTLASS not a dep; no fork has one; activation-quant gather already
fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months,
maintainers deferring to cuTile). No tractable patch closes the 27x.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ttention

Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against
the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one
small, independently-buildable patch so the work rebases cleanly across llama.cpp
bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix
caching) + the regen workflow.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the
CPU-verified vLLM-parity block manager) + CMake entry. No behavior change.
Generated against the pinned LLAMA_VERSION; applies clean.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ical

find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy
generation is token-identical to stock (verified on Qwen3-0.6B at the pin),
branch confirmed firing. Default off. The placement substrate for the gather-read.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by
one aligned index; n_kv compaction; gated so stock stays byte-identical) with
the token-identical gate and the known risks (mask transpose layout, v_trans).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… single-stream first

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix).
Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the
engine's domain (0004 pool + 0005 continuous batching target it). Baseline for
the series to improve on.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…is 54.6% MoE GEMM too

Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both
phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM
gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…Lever 3)

The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37%
prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven).
Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu}
entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always
falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation
phases + parity harness + the dense-path follow-up (#28).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…PP 7.6-32x)

vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765,
vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at
B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision:
the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just
the MoE grouped GEMM (dense GEMM is the simpler first kernel to land).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…n grouped)

Benchmark confirms dense prefill 7.6-32x behind too, so the kernel track needs a
non-grouped FP4 dense GEMM (simpler, land first) + the MoE grouped GEMM. Both
share the e2m1 block-scaled collective; dense is grouped-with-one-group.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ry flag lever exhausted

Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel;
no config/flag shortcut remains.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…line

Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for
sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on
consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward
pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16
(4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to-
apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would
be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer
#2577/#3294, cutlass #3096.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Key corrections: (1) vLLM 24k is AGGREGATE; single-stream roofline ~3300 t/s
(BF16) / 6600 (FP4). (2) GB10 is 1:1:2 BF16:INT8:FP4 - INT8 == BF16, only FP4 is
2x. (3) Measured: dense int8-MMQ at 21% of ceiling, MoE FP4-MMQ at ~5% - both
EXIST, just untuned for Blackwell. Strategy: to MATCH vLLM, tune MMQ or build a
Marlin-style W4A16 BF16 GEMM (FP4 NOT required); to BEAT, fix the existing FP4
MMA on sm_121 (build/miscompile, not greenfield). Dropped the tcgen05 grouped
GEMM rewrite. Cheap next test: dense MXFP4 quant + existing FP4-MMA.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… (~17% of ceiling)

MXFP4 dense moves prefill off int8-MMQ onto the FP4-MMA path (existing kernel) for
a free 1.44x - shippable as a Blackwell dense-quant recommendation. But it's ~17%
of the FP4 roofline, so the FP4-MMA kernel is itself untuned: ~4-6x still in the
kernel. Sharpens the target to TUNING the FP4-MMA (serves dense+MoE, only path to
beat vLLM). Marlin-style W4A16 BF16 is the alt to match on the BF16 ceiling.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ever

Per-user decode is at parity without spec-dec (10.2 vs 11.7, bandwidth-bound).
vLLM's per-user speed = speculative decoding (lossless, target-verified). GB10 is
best-case (bandwidth-bound + idle compute); llama.cpp spec-dec measured 2.9x on
dense Qwen2.5-32B. Qwen3-32B has no native MTP - use Qwen3-1.7B draft or EAGLE3
head. Recommendation: make spec-dec easy for dense >=14B on Blackwell (keeps
Q4_K_M quality, no kernel). Prefill-kernel + continuous-batching are separate
(TTFT / aggregate). Our own DGX run pending (box rebooted, llama-cli hangs).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Phase 1 (config, PR #10411, DONE): VRAM-scaled n_parallel + Blackwell batch.
Phase 2: paged KV (PR #22569, ~9.5x concurrency). Phase 3: chunked prefill +
n_batch/ubatch split. Phase 4: batched-GEMM kernel tuning. Phase 5: backend
sampling. Cross-cutting: spec-dec for dense.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… plan

Decisive DGX experiment: rebuilt with -DGGML_CUDA_FORCE_CUBLAS (it's a compile
#ifdef, not the runtime env we'd been setting - so prior 'cuBLAS no-op' tests
never engaged it). Real result: cuBLAS is SLOWER than MMQ for dense Q4 (pp2048
690 vs 750) and runs an Ampere cutlass_80_tensorop kernel - CUDA-13 has no sm_121
GEMM, falls back to sm_80. So both MMQ and cuBLAS sit at ~46 TFLOP/s; no library
shortcut to the 213 ceiling on GB10. Confirms a hand-tuned sm_120a kernel is
required. Added the phased W4A16 Marlin-style implementation plan (P0 harness ->
P5 enable) as the committed multi-week build; corrected the cuBLAS note.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
mudler added 20 commits June 27, 2026 07:19
The six LocalAI-paged NVFP4 entries advertised GB10 throughput figures with
no machine-readable hardware signal, and the four qwopus/MTP entries lacked
the nvfp4 tag entirely (not discoverable as NVFP4). Per the cross-arch audit
(ARCH_GENERALITY_AUDIT.md section gallery-targeting), NVFP4 GGUFs run
everywhere via dequant (never fail), so the gap is performance-expectation,
not correctness; the only available lever is description + tags.

- Add the nvfp4 tag to the four qwopus/MTP entries that lacked it; the two
  base qwen3.6 entries already had it.
- Add a blackwell tag to all six (precedent: the nvidia hardware tag is
  already used on many gallery entries as a filter chip).
- Lead each of the six descriptions with a one-line Blackwell-recommended /
  runs-slower-off-Blackwell caveat.
- Scope the qwen3.6-27b 90-117% of vLLM claim explicitly to GB10 / DGX Spark
  (consumer Blackwell) so it is not read as a universal figure.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…patch 0030)

Closes audit RISKY-1 (the one latent silent-miscompute hazard). The fused/in-place
Gated Delta Net op (0018/0019/0026) and the discriminated SSM_CONV decode op
(0021/0028, which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET via a non-null
src[3]/src[4] discriminator) are CUDA+CPU-only but were emitted DEFAULT-ON
(cparams.fused_gdn_ar/ch=true, auto_fgdn=true) with no backend guard. A backend
that supports plain SSM_CONV but ignores the discriminator (Vulkan/SYCL/Metal)
would run the wrong plain conv => silent corruption.

Fix: in llama_context::sched_reserve(), before the auto_fgdn resolution, force
fused_gdn_ar = fused_gdn_ch = auto_fgdn = false when any non-CPU compute backend
is not CUDA-family (reg name not "CUDA"/"ROCm"/"MUSA"). Every emission site keys
off these flags, so the graph falls back to the upstream non-fused plain
ggml_ssm_conv + ggml_silu path that every backend handles. On CUDA the reg name is
"CUDA", the flags are left untouched, and the decode graph is byte-identical.

Mirror of DGX paged patch 0030; adds FUSED_OP_BACKEND_GATE_RESULTS.md.

Verified GPU-free: reconstructed pin 9d5d882d + paged 0001-0029 + 0030, CPU-only
build (GGML_CUDA=OFF) of libllama + test-backend-ops links with 0 errors; 0030
applies cleanly via git apply and patch -p1. test-backend-ops correctness for
SSM_CONV/SSM_CONV_UPDATE(_IDS)/GATED_DELTA_NET is CUDA0-vs-CPU (pending DGX,
tunnel offline this session); registered test cases will exercise it.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Close the single build-targeting gap the cross-arch audit (ARCH_GENERALITY_AUDIT.md
section 6, item 2) flagged: the paged backend had no Metal/darwin variant and no
metal: capability key, so a Mac user selecting llama-cpp-localai-paged fell back to
default=cpu (a Linux image) that does not run, with no fallthrough to stock llama-cpp.

Mirror exactly how stock llama-cpp does darwin:

- .github/backend-matrix.yml: add the includeDarwin row
  (-metal-darwin-arm64-llama-cpp-localai-paged, arch arm64, lang go) next to the
  stock llama-cpp darwin row.
- backend/index.yaml: add the metal: capability key to the
  llama-cpp-localai-paged meta-backend plus the metal-llama-cpp-localai-paged and
  -development variant entries (URIs match the matrix tag-suffix); add Metal to tags.
- scripts/build/llama-cpp-localai-paged-darwin.sh: new bespoke darwin build,
  a line-for-line mirror of llama-cpp-darwin.sh swapping the paged wrapper dir,
  binary names, ggml-shared-libs dir and output tar. Same CPU_ALL_VARIANTS + Metal
  path (GGML_METAL=ON via the reused llama-cpp Makefile when OS=Darwin; --target ggml
  pulls in ggml-metal via add_dependencies) with LLAMA_PAGED=on.
- Makefile: add backends/llama-cpp-localai-paged-darwin target (+ .NOTPARALLEL).
- .github/workflows/backend_build_darwin.yml: give the paged backend the same
  bespoke darwin build step as stock llama-cpp, share the llama ccache restore (save
  stays stock-only to avoid a same-run key collision), and exclude it from the
  generic build-darwin-go-backend step.
- scripts/changed-backends.js: comment-only - the paged darwin path mapping was
  already present (forward-looking); update the stale "if a metal row is ever added"
  note now that the row exists.

Metal delivers paged-KV only (NVFP4 FP4-MMA is CUDA/Blackwell-only); the GDN/conv
fused ops have no Metal kernel, so a gated-DeltaNet (qwen35) model falls back to the
CPU reference op at runtime - made SAFE by the fused-op backend gate (patch 0030).
This is config; the Metal build runs in CI on the next push and is runtime-tested on
the M4 Mac.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…-attention

# Conflicts:
#	gallery/index.yaml
The llama-cpp-localai-paged backend reused backend/cpp/llama-cpp's LLAMA_VERSION,
which .github/workflows/bump_deps.yaml auto-bumps nightly to the latest
ggml-org/llama.cpp master tip. The stock backend is patch-free so that bump is
safe, but the paged backend applies a vendored patch series
(backend/cpp/llama-cpp/patches/paged/) hand-verified bit-exact against ONE
specific tip. A naive bump moves the tip out from under the patches and breaks
'git apply' at build time - a dep-bump PR would go red (or, worse, the break
surfaces later in a release build).

Mirror the turboquant precedent: give the paged wrapper its OWN LLAMA_VERSION
pin (the verified 9d5d882d) and force it into every copied build via
LLAMA_VERSION=$(LLAMA_VERSION), so the nightly stock bump no longer drags the
paged build to an unverified tip. Unlike turboquant (whose fork branch carries
the patches and is safe to auto-bump), the paged series is vendored, so it gets
NO bump_deps.yaml entry: it is advanced only by the manual PIN_SYNC process.
Add cross-referencing comments in both Makefiles and bump_deps.yaml.

Also add PIN_BUMP_APPLY_CHECK.md: an apply-feasibility report for the latest tip
(c299a92c, 23 commits ahead). The full series applies CLEAN under 'git apply'
with only benign line offsets and zero conflicts; the lone failure (0019) is a
pre-existing stray dev-doc hunk, identical on the current pin, not a bump
regression.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The paged backend (backend/cpp/llama-cpp-localai-paged) pins its own verified
llama.cpp tip and is excluded from the nightly auto-bumper so a naive bump can
never silently break the shipped build. That exclusion also removed the early
warning of upstream drift. This restores the signal without touching the pin.

Add .github/workflows/llama-cpp-paged-canary.yml (weekly + workflow_dispatch):

- apply-check job (ubuntu-latest, toolchain-free): resolve the latest
  ggml-org/llama.cpp master tip, shallow-checkout it, and apply the full paged
  series 0001-0030 in order with the build's own git-apply method via the new
  shared helper .github/scripts/paged-canary-apply.sh. Red on any apply break.
- compile job (needs apply-check): on the exact tip it validated, build the
  paged backend (cublas) inside the same base-grpc-cuda-12 toolchain and the
  same `make grpc-server` target the shipped build uses, so a red means upstream
  drift, not toolchain noise. nvcc compiles the kernels with no GPU present.

Red here = run a PIN_SYNC (rebase + bit-exact gate + re-export), then bump the
paged Makefile pin. The canary is signal-only: it opens no PR and never moves
the pin, so the shipped build and the dep-bump PRs stay green regardless. It is
fully separate from bump_deps.

The lone pre-existing quirk in the series (patch 0019 carries a stray modify
hunk against the dev-only doc SSM_DECODE_FIX_RESULTS.md, absent from any clean
upstream checkout; git apply is atomic so it rejects the whole patch and
cascades to 0021/0022/0026/0028) is handled path-scoped: the helper excludes
only that dev-doc and still applies 0019's real code hunks atomically, mirroring
prepare.sh's tolerance, so the quirk never false-positives the canary but a
genuine code break in 0019 still turns it red.

Point the existing pin comments in backend/cpp/llama-cpp-localai-paged/Makefile
and .github/workflows/bump_deps.yaml at this canary as the drift signal, and
document it in the PIN_SYNC doc: canary red -> do a pin-sync.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ean checkout

The shipped from-patches build applies the paged series with strict `git apply`
(backend/cpp/llama-cpp/Makefile `llama.cpp` target:
`git apply --verbose "$p" || { ...; exit 1; }`), which is atomic: a hunk against
a file missing from the tree rejects the whole patch and fails the build. Four
patches carried hunks against dev-only docs that live in the DGX dev tree but are
absent from a clean ggml-org/llama.cpp checkout, so the build only succeeded on
the DGX and FAILED on CI / any clean checkout:

  0019 -> SSM_DECODE_FIX_RESULTS.md   (modify hunk = the root reject)
  0020 -> LEVER1_OPROJ_MMQ_RESULTS.md (create)
  0021 -> CONV_STATE_FUSION_RESULTS.md (create)
  0028 -> LEVER1_GATHER_PROGRESS.md, LEVER1_GATHER_RESULTS.md (create)

0019's reject cascaded to 0021/0022/0026/0028 (which build on 0019's code). Strip
each `diff --git a/<devdoc>` section plus its diffstat line, `create mode`
trailer, and correct the summary count. Every llama.cpp SOURCE hunk is left
byte-identical (verified by sha256 of each patch's source-diff tail).

Verified on a fresh clone of ggml-org/llama.cpp at the pin 9d5d882d: BEFORE,
strict `git apply` failed at 0019 (cascade 0019/0021/0022/0026/0028); AFTER, the
full series 0001-0030 applies with exit 0 (sentinel created, zero stray docs).
The tolerant `patch -p1` fallback in prepare.sh also applies with zero rejects.

PIN_SYNC_9d5d882d.md documents the durable fix: re-exports/pin-syncs must keep
patches source-only (export with a source pathspec / `:!*.md`, gate with a strict
`git apply` on a clean checkout). The upcoming c299a92c pin-bump re-export must
produce source-only patches too.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ied)

Advance the paged-attention backend's owned llama.cpp pin by 23 upstream
commits. The shipped source-only patch series (0001-0030, 28 patches) applies
strict-clean (git apply, exit 0) on a fresh c299a92c checkout with no re-export
needed, and the bit-exact gate is GREEN on every path on GB10 (CUDA sm_121):

- md5 greedy decode (-ngl 99 -fa on -n 48 --temp 0 --seed 1): dense
  non-paged/paged 5951a5b4, MoE non-paged 07db32c2, MoE paged 8cb0ce23; all
  match the established baselines.
- test-backend-ops CUDA0: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16,
  SSM_CONV_UPDATE_IDS 16/16, GATED_DELTA_NET 84/84, MUL_MAT 1146/1146,
  MUL_MAT_ID 806/806; all OK.

The 23-commit upstream jump did not change our decode output. The .patch files
are kept byte-identical (they already apply strict-clean at the new pin); only
the pin, the PIN_SYNC evidence doc, and the canary/gallery doc references change.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The paged-attention patch directory had accumulated ~55 scattered dev docs
(results, progress, scope, lever, and gap-analysis notes). Consolidate the
durable content of all of them into one canonical
backend/cpp/llama-cpp/patches/paged/README.md covering: what the patchset is,
the architecture (paged KV + block-table flash-attn, the gated-DeltaNet SSM
decode path, NVFP4 FP4-MMA, the decode-first scheduler), the full 0001-0030
patch series table with bit-exact status, the GB10 benchmarks
(patched-vs-stock-vs-vLLM + the Apple M4 architectural note), the dev notes
(bit-exact methodology, the per-path gate, the MoE-parity conclusion, the
rejected/flat levers, the opt-in bf16-SSM mode), arch+quant generality, the
pin + canary maintenance policy, and the published NVFP4 gallery models.

Delete the consolidated-away dev trail. Keep the three operational docs the
README links to: PIN_SYNC_c299a92c.md (canary reference), PAGED_BITEXACT_NOTE.md
(per-path gate reference) and LOCALAI_LLAMACPP_BACKEND_PLAN.md (the
ship-as-own-backend design-of-record), plus the benchmark plots + csv. The
.patch files and the unit/bench .cpp are untouched.

Repoint every external reference to a deleted doc at the new README:
grpc-server.cpp, docs/content/features/backends.md, gallery/index.yaml, the
canary apply script (PIN_BUMP_APPLY_CHECK.md -> README), and the base
patches/README.md (ADDITIVE_DESIGN.md -> README). The canary's PIN_SYNC
reference still resolves; its inert SSM_DECODE_FIX_RESULTS.md glob (a
patch-internal path matcher, not a repo-doc link) is left intact.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…s patch series

Move ALL paged-attention content out of the stock backend/cpp/llama-cpp
backend and into backend/cpp/llama-cpp-localai-paged, so the stock backend is
pure upstream llama.cpp and the paged backend owns and applies its own vendored
patch series.

- Delete the dead early-exploration scaffold backend/cpp/llama-cpp/paged/
  (kernel/w4a16 Marlin scaffold, standalone paged_kv_manager, bench/loadgen,
  its own 0001-0002 patches, dense-era design docs, tests). Zero references
  repo-wide.
- Move backend/cpp/llama-cpp/patches/ (the 28-patch paged series + paged/README
  + 3 operational docs, plus the kernel/ scaffold patch and the top-level paged
  README/BENCHMARKS) to backend/cpp/llama-cpp-localai-paged/patches/. The stock
  backend keeps no patches/ dir; it had no non-paged base patches.
- Purify the stock backend: remove the LLAMA_PAGED make variable, the
  patches/paged apply loop, and the LLAMA_PAGED passthrough to prepare.sh;
  remove the paged-series handling from prepare.sh. The stock llama.cpp target
  now only clones the pin and applies its own (currently empty) base patches/
  series. The runtime paged option hooks in the shared grpc-server.cpp are
  untouched (inert without the patches).
- The paged backend's Makefile now applies its OWN patches/paged/0*.patch onto
  each freshly cloned tree via strict git apply (apply-paged-patches), after the
  copied stock infra clones the pin and applies base patches.
- Repoint every reference to the old patches/paged path: the upstream canary
  workflow + apply script, bump_deps.yaml, gallery/index.yaml, the docs,
  backend/index.yaml, backend-matrix.yml, the top-level Makefile comments, and
  the moved PIN_SYNC / README docs. Drop the now-removed LLAMA_PAGED=on
  build-toggle from comments.

Verified: the full 28-patch series applies strict-clean (git apply, exit 0) to
a clean ggml-org/llama.cpp checkout at the pinned c299a92c, and the repointed
canary apply script resolves and applies the series end to end.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… learnings

Section 4(c): real Apple M4/Metal numbers (Qwen3-8B Q4_K_M, stock vs patched) -
patchset is neutral-to-slightly-negative on Metal (the in-kernel block-table read
is CUDA-only; NVFP4/GDN-fusions inert), so prefer stock llama-cpp on Apple Silicon.
Vulkan: same picture, worse (no upstream GDN op). Section 6: cross-backend learnings
+ upstream candidates (the GDN decode-plumbing fusions are the portable, bit-exact,
CPU-mirrored win worth upstreaming).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…al/Vulkan/SYCL)

Source-only analysis of what it would take to give the gated-DeltaNet decode
fusions (0018 in-place state write-back, 0019 fused recurrent-state gather,
0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on
the non-CUDA compute backends, so the patch-series decode win extends past
CUDA-family hardware.

Key findings:
- The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist
  upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is
  stale). The Qwen3.6 hybrids run on all three today via the non-fused path;
  Layer-2 is the decode SPEEDUP, not enabling the model to run.
- Per backend the new work is only the FUSION plumbing: redirect the GDN state
  write (in-place), add the ids read, write one new conv-update kernel + its
  ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan)
  pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and
  test-backend-ops cases are shared and already done.
- Bit-exactness is feasible per backend by construction (the fusions redirect
  addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU)
  is the gate.
- The 0030 name allow-list should become capability-driven (make supports_op
  authoritative for the discriminated src slots).
- Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup =
  simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to
  author), then Vulkan (widest hardware reach but the shader-gen + variant
  matrix + subgroup variance make it the capstone).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…y rationale

The gated-DeltaNet + SSM_CONV ops have upstream Metal/Vulkan/SYCL kernels, so the
Qwen3.6 hybrids run there (non-fused) - the earlier 'no Vulkan kernel' note was
wrong. The patchset's fusions are gated off off-CUDA, so the backend ships
CUDA-only; non-CUDA users use stock llama-cpp.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The paged backend previously built for cublas/cuda, cpu, vulkan, sycl,
hipblas and darwin/metal. On non-CUDA the patchset's wins are inert: the
GDN fusions are gated off (patch 0030) and NVFP4 falls back to dequant,
so the backend is neutral-to-negative there (README section 4c). The
darwin grpc-server link also fails on undefined upstream server symbols,
turning CI red. Both broken and pointless off-CUDA, so ship CUDA-only.

- backend-matrix.yml: drop the hipblas, sycl f32/f16, cpu amd64/arm64,
  vulkan amd64/arm64 and metal-darwin rows for this backend; keep the
  four cublas rows (cuda-12, cuda-13, nvidia-l4t cuda-12 and cuda-13).
- index.yaml: meta-backend (and -development) capabilities are now
  CUDA-only with default pointing at cuda12 (mirrors faster-qwen3-tts);
  removed the orphaned cpu/rocm/sycl/vulkan/metal variant entries.
- Removed the now-unused darwin build script and its Makefile target /
  .NOTPARALLEL entry / backend_build_darwin.yml step.
- Documented the CUDA-only build coverage in the patch README and plan.

Non-CUDA users should use the stock llama-cpp backend.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… skills

Two .agents guides (indexed in AGENTS.md):
- llama-cpp-localai-paged-backend.md: what the CUDA-only paged backend is, the
  patchset scope, the bit-exact gate, the manual pin-sync + weekly canary, the
  CUDA-only / stock-stays-pure invariants, and the Metal/SYCL/Vulkan follow-up scope.
- vllm-parity-methodology.md: the decode-parity playbook (bit-exact gating,
  profile-don't-assume, both-engine ground-truth, per-lever A/B, recording rejected
  levers, multi-agent GPU orchestration).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…to docs/

The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.

Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
  final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)

Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)

Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.

The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…rver link

The c299a92c bump diverged 23 commits ahead of the stock llama-cpp pin.
grpc-server.cpp is SHARED with the stock backend and tracks the stock pin;
c299a92c's upstream server-API refactor pulled stream_* helpers into the headers
grpc-server.cpp includes, whose definitions the stock-aligned build does not
compile -> every paged variant failed to LINK (undefined reference to
stream_aware_should_stop / stream_pipe_producer::cleanup /
stream_session_attach_pipe). The bump was greedy-md5 bit-exact, but the bit-exact
gate never exercises the full grpc-server build, so it slipped through.

Revert LLAMA_VERSION to 9d5d882d (== stock pin, where the patches are bit-exact
AND grpc-server links - the original DGX-proven baseline). Document the hard
constraint in the Makefile, README, PIN_SYNC record, and the .agents guide: the
paged pin must track the stock pin, and a pin-sync must pass the full CI
grpc-server build, not only the bit-exact gate.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The paged backend's llama.cpp pin was reverted from c299a92c back to
9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the
reverted sync) is dead weight. The pin-sync PROCESS stays documented in
the three live places: the Makefile comment, README section 7 (Pin +
maintenance policy), and .agents/llama-cpp-localai-paged-backend.md.

Delete the doc and repoint every reference to it (Makefile, README,
.agents, canary script + workflow) at README section 7. No functional
paths change: the canary's patches-dir glob (patches/paged/0*.patch)
is untouched.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…bf16-tau)

Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE
q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and
README section 4 carry a single consistent set of llama numbers with all three
configs:

- stock: separately-built unpatched llama.cpp at this backend's exact pin
  9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce
  stock - the SSM decode fusions are compiled in, not env-gated).
- patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE).
- patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact,
  ~91% same-top-p).

final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both
models at all four widths (the prior CSV had no stock and no bf16-tau rows).
peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and
the bench does not print it, so per-run peak could not be captured this session.

Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in
bf16-tau adds a further +3% to +17% on top of patched (growing with width).
vLLM column is kept from the prior session (not re-run) and labeled as such.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…-in wins

The DGX re-run showed toggling LLAMA_KV_PAGED on/off on the patched binary does
NOT reproduce stock: the dominant SSM decode fusions are compiled in, not
runtime-gated, so the toggle measures only the (here ~neutral) paged-KV part.
True stock needs a separately-built unpatched binary at the same pin. Correct the
methodology skill's per-lever discipline + apples-to-apples rule accordingly.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler marked this pull request as ready for review June 27, 2026 22:19
mudler added 9 commits June 27, 2026 22:20
…overview

Rebuild the two committed decode plots from the re-measured CSV and add a combined
overview. Three series per the comparison that matters: llama.cpp (standard) vs
vLLM vs LocalAI's llama.cpp patches; x-over-standard called out at npl128. bf16-tau
stays out of the plot (it remains in the CSV + the README table as the opt-in row).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Per request, the plots now show all four series: llama.cpp (standard), vLLM,
LocalAI's llama.cpp patches (bit-exact hero), and LocalAI's patches + bf16-tau
(opt-in ceiling, +3% to +17% over the patches, ahead of vLLM at every dense width
and MoE npl>=32). Subtitle flags bf16-tau as opt-in / not bit-exact.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…CUDA-13 only

The paged backend targets Blackwell sm_121a, which CUDA 12.0 cannot target
at all, so the CUDA-12 variants were pointless. They were also broken: the
cublas-12 / nvidia-l4t / arm64 build failed to compile paged-kv-manager.cpp
("no declaration matches ...", a ~10-function mismatch the older
cuda-12-base gcc rejects). CUDA-13 compiles it fine (confirmed on GB10).

Removed (config-only, scoped to the paged backend):
- backend-matrix.yml: the two CUDA-12 paged rows
  (-gpu-nvidia-cuda-12-llama-cpp-localai-paged,
   -nvidia-l4t-arm64-llama-cpp-localai-paged)
- backend/index.yaml: CUDA-12 capability keys (nvidia-cuda-12,
  nvidia-l4t-cuda-12, nvidia-l4t) on both meta-backends, repointed
  default/nvidia to the cuda13 amd64 variant, and dropped the orphaned
  cuda12-* / nvidia-l4t-arm64-* variant definitions (latest + -development).

Kept CUDA-13 only: cuda13-llama-cpp-localai-paged (amd64) and
cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged (l4t arm64). Matrix
tag-suffixes <-> index variant URIs form a clean 2:2 bijection.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fixes cuda-13 amd64 / non-arm64 build where size_t was used without the
header (arm64 cuda-13 pulled it in transitively; amd64/cuda-12 toolchains
do not). Compile-only change, bit-exactness unaffected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ix amd64/non-arm64 build; compile-only)

Vendored paged headers used size_t / uintN_t without including <cstddef> /
<cstdint>. The arm64 DGX toolchain provides them transitively so the build
passed there, but amd64/older toolchains do not, failing the CI amd64 build one
header at a time ('size_t' does not name a type -> cascade).

paged-kv-manager.h was already fixed. This adds the missing includes to the
remaining vendored headers at the point each is created/rewritten in the patch
series so every src/paged*.h self-includes both:

  * paged-attn.h     (0003): add <cstddef> (had <cstdint>)
  * paged-alloc.h    (0007): add <cstddef> (had <cstdint>)
  * paged-prefix-api.h (0007): add <cstddef> + <cstdint> (had only llama.h)

The .cpp units include their own paged header, so they inherit the includes
transitively. Whole series still applies clean on the pinned llama.cpp.

Compile-only change: no runtime behavior change, bit-exactness unaffected.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ve/restore (patch 0026)

The opt-in ssm_bf16_tau hybrid mode splits a gated-DeltaNet layer's
recurrent SSM state into an f32 partition (s_l) and a bf16 partition
(s_l_bf16). The recurrent state serialization paths (state_write_data /
state_read_data) were never updated for the split: they read/wrote s_l
using the FULL hparams.n_embd_s() (S_v*S_v*H) row width, but a split
layer's s_l only holds S_v*S_v*n_f32, so the access overruns the smaller
tensor (a ggml_backend tensor read out of bounds), and the bf16
fast-head partition was never persisted at all.

This is what broke high-concurrency serving with --ssm-bf16-tau: the
server's context-checkpoint feature serializes per-sequence state via
state_seq_get_data. With a checkpoint enabled, even a single request
triggered the out-of-bounds read; at higher concurrency the cell range
starts at a higher base slot so the overrun reaches further (hard abort
in a debug build, silent state corruption then 1-token-then-EOS on
restore in a release build). The static batched-bench never exercises
save/restore so it did not catch it; the GDN decode kernel and per-head
partition offsets were already correct (decode with checkpoints disabled
is fine at N=8/16/32).

Fix: serialize the f32 partition and, when the layer is split, the bf16
partition right after it, each with its OWN row width (tensor ne[0]).
head_slot is rebuilt deterministically at load (same model + tau), so it
is not serialized. Non-split layers have ne[0] == n_embd_s() and no bf16
partition, so their on-disk format and behavior are byte-identical (the
default f32 path and the bit-exact gate are unaffected).

Verified on GB10/DGX with Qwen3.6-35B-A3B-NVFP4 + --ssm-bf16-tau 64 via a
continuous-batching llama-server: with context checkpoints enabled, N=8,
N=16 and N=32 (slot reuse + restore) all now produce full coherent
128-token output and the server stays up; pre-fix the same config
aborted on the first checkpoint.

Assisted-by: Claude:claude-opus-4-8[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
master auto-bumped the stock llama-cpp pin 9d5d882d -> 0ed235ea and updated the
shared grpc-server.cpp. The paged backend's pin must track the stock pin (the
grpc-server.cpp is shared), so bump its LLAMA_VERSION to match. All 28 paged
patches apply clean on 0ed235ea (verified against a fresh upstream clone). The
bf16-tau state-serialization fix (patch 0026) is included. Bit-exact gate + full
grpc-server build verify on GPU/CI to follow.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… (+ROCm)

Add ACCELERATOR_PORTING_SCOPE.md, the umbrella scope for taking the paged
backend's accelerator-portable wins off the CUDA family. It builds on (does
not duplicate) UPSTREAM_LAYER2_SCOPE.md, which stays the GDN/SSM-fusion
detail (benefit #1), and adds:

- Benefit #2 (paged KV in-kernel block-table flash-attn read, 0009-0011):
  new per-backend feasibility from source analysis of the Metal/SYCL/Vulkan
  flash-attn kernels. SYCL EASY (near line-for-line CUDA mirror), Metal
  EASY-MEDIUM (decode already routes to the vec kernel), Vulkan MEDIUM (the
  fast coopmat2 NVIDIA decode path cannot do the indexed read; push-constants
  are full). Universal constraint: only the vec/scalar decode kernel admits
  the per-cell indexed read, so route block-table ops onto vec (as CUDA's
  0009-0010 dispatch guard already does) and leave the fast MM/coopmat2 path
  contiguous-only. This is the lever that flips paged KV from
  neutral-to-slightly-negative to non-negative off CUDA.
- Benefit #3 (decode-first scheduler, 0013/0016): confirmed a free portable
  win - host-side update_slots() policy, zero kernel work, runs on any
  accelerator as-is.
- Benefit #4 (NVFP4 FP4-MMA, 0017/0023/0025): out of scope (Blackwell only);
  flags the backend-agnostic analogues of the act-quant dedup and the
  graph-coverage lever without over-claiming a port.
- A ROCm note: ROCm rides the CUDA/HIP path (validate, don't re-port);
  FP4-MMA stays Blackwell-only.

Benefits #1 and #2 share the port shape and rank Metal->SYCL->Vulkan, so they
bundle into one per-backend PR behind a shared ops-first PR. Cross-link added
from UPSTREAM_LAYER2_SCOPE.md. All gates are test-backend-ops on-target (no
Metal/SYCL/Vulkan/ROCm hardware here).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… pin 0ed235ea

llama.cpp renamed the RPC tool target (tools/rpc/CMakeLists.txt: set(TARGET
ggml-rpc-server)) at the 0ed235ea pin. master already updated the stock
llama-cpp Makefile to match (--target ggml-rpc-server, cp bin/ggml-rpc-server);
the paged backend's separate Makefile copy was left stale and its -grpc (RPC)
variant failed with 'No rule to make target rpc-server' (grpc-server itself
built to 100%). Mirror the stock rename in the paged Makefile.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants