Post-0.6.0 perf track: the batched AVX-512 sign kernel measures ~1.0 ns/(query·doc) vs ~0.3-0.35 theoretical ALU floor; the transpose-tree reduction experiment measured neutral (+3% within noise, branch perf/kernel-transpose-reduce, parked) — the remaining win requires doc-major score layout (contiguous stores per doc-chunk) with a tile transpose or strided collector reads, plus caller-owned parallel single-query via top_m_candidates_range (greenlit design: boring exact partitioning, deterministic merge). Evidence and analysis in ordinaldb's perf-train PR trail.
Post-0.6.0 perf track: the batched AVX-512 sign kernel measures ~1.0 ns/(query·doc) vs ~0.3-0.35 theoretical ALU floor; the transpose-tree reduction experiment measured neutral (+3% within noise, branch perf/kernel-transpose-reduce, parked) — the remaining win requires doc-major score layout (contiguous stores per doc-chunk) with a tile transpose or strided collector reads, plus caller-owned parallel single-query via top_m_candidates_range (greenlit design: boring exact partitioning, deterministic merge). Evidence and analysis in ordinaldb's perf-train PR trail.