perf: stream the corpus once per call in sign candidate generation#278
perf: stream the corpus once per call in sign candidate generation#278Nelson Spence (Fieldnote-Echo) wants to merge 6 commits into
Conversation
Independent oracle (score_all + full lexicographic sort by hamming asc, doc_id asc) pins top_m_candidates and top_m_candidates_batched_serial_csr exactly: random corpora across block boundaries, massive-tie and duplicate-run corpora exercising boundary tie-breaks, edge geometries (m >= n, single doc, empty batch), and the dim=1024 shape. Must pass bit-identically before and after the tiling swap.
top_m_candidates_batched_serial_csr previously looped the single-query path, re-streaming the full sign bitmap per query (documented-naive Track-1). The internals now scan the corpus once per call in L2-sized doc blocks, score every query of the call against each hot block in query tiles via the existing batched kernel, and select per-query top-m with bounded (hamming, doc_id) min-collectors — bit-identical to a full sort by construction, independent of processing order (the key IS the contract's sort key). top_m_candidates routes through the same core, dropping its per-call n-row Hamming materialisation. Per-query corpus traffic drops by the call's query count: at 1.26M rows x 1024 dims, a 2048-query call reads the 161MB sidecar once instead of 2048 times. Serial contract preserved (no rayon); the oracle suite (tests/tiled_candgen.rs) pins bit-identical outputs across random, tie-heavy, duplicate-run, and edge geometries.
|
Codex (@codex) review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
1 similar comment
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
PR Summary by QodoStream sign-bitmap corpus once per call for exact top-m candidate generation
AI Description
Diagram
High-Level Assessment
Files changed (2)
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Code Review by Qodo
1. Heap overhead scales poorly
|
Audit remediation: routing top_m_candidates through the streamed core measured +50-90% at small/medium n with m in the hundreds (bounded heap O(n log m) vs select_nth_unstable_by O(n)); with one query there is no scan to share, so nq=1 stays on the dense path (parity-or-better at every measured size). Also per audit: the block-boundary oracle test now genuinely spans three blocks (the dim=128 shape fit one block), and adds the dim=768 AVX-512 tail-residue x multi-block case to the permanent suite.
Adversarial audit triage (code-reviewer agent, empirical)Verdict: no correctness or memory-safety defects. The bit-identical claim held under adversarial testing: block-boundary tie runs, all-tied corpora across m ∈ {1..1301}, tiny-n floor clamp, 70-query multi-tile × multi-block CSR, and the real AVX-512 VPOPCNTDQ kernel across tail residues (rem ∈ {0,1,4}). Heap selection is order-independent by uniqueness of (hamming, doc_id) keys; kernel sub-slicing indexes purely relative to the slice base; scores buffer fully overwritten per tile; doc-id u32 cast protected by the MAX_VECTORS=64Mi invariant. Fixed in follow-up commit:
Noted, accepted: |
Bot-review remediation (Qodo, PR #278): the shared core derived nq by integer division; a ragged buffer from a future caller would silently truncate. All current callers validate upstream — this is the cheap in-core invariant.
|
Qodo triage: (1) ragged query buffer silently truncated — fixed in |
That triage looks right.
If you want, I can help phrase the deferred note for the PR summary. |
Bot-review remediation (Qodo, PR #278): nq * m_eff can overflow usize on 32-bit/wasm32 targets, and the CSR wrapper's saturating_mul would attempt a usize::MAX allocation. Both sites now use checked multiplication with a clear tile-the-batch message, matching the crate's checked-allocation discipline. The exact m_eff + 1 heap reservation is kept deliberately: gradual growth double-allocates to the next power of two (~2x peak per query) — the reservation is the memory-optimal choice, now documented.
|
Qodo triage update (heap-overhead finding): remediated in On the suggested alternatives: gradual heap growth was evaluated and rejected as strictly worse — |
Acknowledged — that does address finding 1.
So from my side, this looks remediated. |
Bot-review remediation (Qodo, #283 inline): build_query_bitmap allocated a fresh Vec and re-validated finiteness per query; the entry points already validate the whole buffer and the destination is preallocated. Oracle suites pin bit-identical output.
Summary
Replaces the documented-naive internals of
top_m_candidates_batched_serial_csr(per-query full-corpus rescans) with a doc-major tiled scan: the corpus streams once per call in L2-sized blocks, each hot block is scored against every query of the call (32-query tiles through the existingsign_scan_collect_batchedkernel), and per-query bounded min-collectors keyed(hamming, doc_id)select the exact lexicographic top-m.top_m_candidatesroutes through the same core and stops materialising its n-row Hamming buffer.The swap is sanctioned by the method's own doc comment ("a future release may replace the internals with streaming top-m behind this frozen signature"). Serial contract preserved: zero rayon; callers own parallelism.
Determinism
The collector key is the contract's sort key, so selection is bit-identical to full-sort-then-truncate by construction — no dependence on block processing order. Pinned by
tests/tiled_candgen.rs(committed first, green against the old implementation): independent oracle viascore_all+ lexicographic sort, over random corpora spanning many blocks, massive-tie and duplicate-run corpora (tie runs longer than m), edge geometries (m ≥ n, single doc, empty batch), and the dim=1024 shape.Why (measured motivation)
At 1,258,135 rows × 1024 dims the sign sidecar is 161MB; the old path re-streamed it per query — a 2048-query batch cost 330GB of DRAM traffic (~9.3s measured, ~220 q/s). One pass per call cuts per-query traffic by the call's query count; batch becomes compute-bound. Full before/after evidence lands with the OrdinalDB integration rerun on the same corpus (evidence-discipline: this PR's synthetic n=50k committed bench is L3-resident and understates the win by design; large-corpus numbers follow in the dependent PR).
Gate
Full suite + no-default-features + clippy -D warnings + MSRV 1.89 +
-D warningsbuild + fuzz build: green. New oracle tests: 5 suites.Part of the 1M-row release train (Track A2 of the locked master plan). Queued behind #277.