perf: stream the corpus once per call in sign candidate generation by Fieldnote-Echo · Pull Request #278 · Project-Navi/ordvec

Fieldnote-Echo · 2026-07-03T18:11:52Z

Summary

Replaces the documented-naive internals of top_m_candidates_batched_serial_csr (per-query full-corpus rescans) with a doc-major tiled scan: the corpus streams once per call in L2-sized blocks, each hot block is scored against every query of the call (32-query tiles through the existing sign_scan_collect_batched kernel), and per-query bounded min-collectors keyed (hamming, doc_id) select the exact lexicographic top-m. top_m_candidates routes through the same core and stops materialising its n-row Hamming buffer.

The swap is sanctioned by the method's own doc comment ("a future release may replace the internals with streaming top-m behind this frozen signature"). Serial contract preserved: zero rayon; callers own parallelism.

Determinism

The collector key is the contract's sort key, so selection is bit-identical to full-sort-then-truncate by construction — no dependence on block processing order. Pinned by tests/tiled_candgen.rs (committed first, green against the old implementation): independent oracle via score_all + lexicographic sort, over random corpora spanning many blocks, massive-tie and duplicate-run corpora (tie runs longer than m), edge geometries (m ≥ n, single doc, empty batch), and the dim=1024 shape.

Why (measured motivation)

At 1,258,135 rows × 1024 dims the sign sidecar is 161MB; the old path re-streamed it per query — a 2048-query batch cost 330GB of DRAM traffic (~9.3s measured, ~220 q/s). One pass per call cuts per-query traffic by the call's query count; batch becomes compute-bound. Full before/after evidence lands with the OrdinalDB integration rerun on the same corpus (evidence-discipline: this PR's synthetic n=50k committed bench is L3-resident and understates the win by design; large-corpus numbers follow in the dependent PR).

Gate

Full suite + no-default-features + clippy -D warnings + MSRV 1.89 + -D warnings build + fuzz build: green. New oracle tests: 5 suites.

Part of the 1M-row release train (Track A2 of the locked master plan). Queued behind #277.

Independent oracle (score_all + full lexicographic sort by hamming asc, doc_id asc) pins top_m_candidates and top_m_candidates_batched_serial_csr exactly: random corpora across block boundaries, massive-tie and duplicate-run corpora exercising boundary tie-breaks, edge geometries (m >= n, single doc, empty batch), and the dim=1024 shape. Must pass bit-identically before and after the tiling swap.

top_m_candidates_batched_serial_csr previously looped the single-query path, re-streaming the full sign bitmap per query (documented-naive Track-1). The internals now scan the corpus once per call in L2-sized doc blocks, score every query of the call against each hot block in query tiles via the existing batched kernel, and select per-query top-m with bounded (hamming, doc_id) min-collectors — bit-identical to a full sort by construction, independent of processing order (the key IS the contract's sort key). top_m_candidates routes through the same core, dropping its per-call n-row Hamming materialisation. Per-query corpus traffic drops by the call's query count: at 1.26M rows x 1024 dims, a 2048-query call reads the 161MB sidecar once instead of 2048 times. Serial contract preserved (no rayon); the oracle suite (tests/tiled_candgen.rs) pins bit-identical outputs across random, tie-heavy, duplicate-run, and edge geometries.

Fieldnote-Echo · 2026-07-03T18:11:54Z

Codex (@codex) review

chatgpt-codex-connector · 2026-07-03T18:11:58Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

chatgpt-codex-connector · 2026-07-03T18:12:04Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

qodo-code-review · 2026-07-03T18:13:21Z

PR Summary by Qodo

Stream sign-bitmap corpus once per call for exact top-m candidate generation

✨ Enhancement 🧪 Tests 🕐 40+ Minutes

AI Description

• Replace per-query corpus rescans with a doc-major tiled scan shared by single and batched APIs.
• Select exact lexicographic top-m via per-query bounded (hamming, doc_id) collectors for
 determinism.
• Add oracle-based tests pinning outputs across random, tie-heavy, duplicate, and edge geometries.

Diagram

graph TD
A["SignBitmap candidate APIs"] --> F{"Empty batch or m==0"} -->|"no"| B["top_m_candidates_streamed"] --> C["sign_scan_collect_batched"] --> D["Per-query BinaryHeap top-m"] --> E["Top-m results (Vec/CSR)"]
F -->|"yes"| E
T["tests/tiled_candgen.rs oracle"] --> A

High-Level Assessment

The following are alternative approaches to this PR:

1. Bucketed top-m by Hamming distance (counting-style)

➕ Hamming is bounded (0..dim), enabling deterministic selection with low per-item overhead.
➕ Potentially faster than BinaryHeap for large n with small dim (e.g., 1024).
➖ More complex to implement correctly with doc_id tie-break and streaming blocks.
➖ Memory can spike if many docs share low Hamming buckets; still need doc_id materialization per bucket.

2. Fixed-capacity partial selection (array + linear scan)

➕ Avoids BinaryHeap’s log(m) and allocation overhead; good when m is small/medium.
➕ Can be SIMD/cache friendly with tight loops.
➖ O(m) update per candidate unless carefully optimized; may regress for larger m.
➖ More bespoke code and harder to reason about than standard heap semantics.

3. Two-stage approximate then refine (coarse prefilter)

➕ Could further reduce compute by avoiding full scoring for obviously-bad docs.
➕ May open room for future ANN-like shortcuts.
➖ Violates the PR’s stated goal of exact, bit-identical top-m.
➖ Adds tuning knobs and risk of recall/quality regressions.

Recommendation: Keep the PR’s current streamed exact approach: it preserves the frozen ordering contract by using the contract key (hamming, doc_id) as the collector key, and it eliminates the dominant bandwidth waste (per-query rescans) without changing API semantics. If further speed is needed later, the most promising follow-up is swapping BinaryHeap for a fixed-capacity or bucketed collector while keeping the same streamed scan structure and tests as a safety net.

Files changed (2) +246 / -28

Enhancement (1) +89 / -28

sign_bitmap.rsAdd streamed/tiled exact top-m core and route single/batched CSR through it +89/-28

Add streamed/tiled exact top-m core and route single/batched CSR through it

• Introduces 'top_m_candidates_streamed', which scans the corpus once per call in L2-sized doc blocks and scores queries in small tiles via the existing batched kernel. Replaces the single-query 'top_m_candidates' full-materialization + selection with the streamed collector-based selection, and updates 'top_m_candidates_batched_serial_csr' to use the same core (including explicit empty-batch / m==0 handling) while preserving deterministic '(hamming, doc_id)' ordering.

src/sign_bitmap.rs

Tests (1) +157 / -0

tiled_candgen.rsOracle contract tests for sign candidate generation determinism and ties +157/-0

Oracle contract tests for sign candidate generation determinism and ties

• Adds an implementation-independent oracle (dense 'score_all' + full lexicographic sort) and asserts bit-identical results for both single-query and batched-CSR APIs. Covers random corpora across block boundaries, tie-heavy and duplicate-document cases, and edge geometries including empty query batches and dim=1024.

tests/tiled_candgen.rs

codecov · 2026-07-03T18:14:44Z

Codecov Report

❌ Patch coverage is 97.01493% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/sign_bitmap.rs	97.01%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

qodo-code-review · 2026-07-03T18:19:40Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)

1. Heap overhead scales poorly 🐞 Bug ☼ Reliability

Description

top_m_candidates_streamed allocates one BinaryHeap per query and stores (hamming, doc_id) for each
retained candidate, adding O(nq*m_eff) extra state on top of the CSR output. For large batches
and/or large m, this can significantly raise peak memory and increase the risk of allocator aborts
or severe memory pressure.

Code

src/sign_bitmap.rs[R253-255]

+        let mut heaps: Vec<BinaryHeap<(u32, u32)>> = (0..nq)
+            .map(|_| BinaryHeap::with_capacity(m_eff + 1))
+            .collect();

Relevance

⭐⭐ Medium
Mixed history: they accept allocation reductions, but have rejected some “avoid allocation/OOM”
refactors as out-of-scope.
PR-#96
PR-#12
PR-#224

ⓘ Recommendations generated based on similar findings in past PRs

Evidence

The streamed implementation constructs a per-query BinaryHeap reserved to m_eff + 1, and the
serial CSR API now routes batched candidate generation through this helper, making the additional
per-query heap memory part of the normal execution path.

src/sign_bitmap.rs[247-256]
src/sign_bitmap.rs[391-414]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`top_m_candidates_streamed` allocates `nq` heaps and eagerly reserves `m_eff + 1` capacity for each, storing `(hamming, doc_id)` keys. This increases peak memory by roughly `nq * m_eff * size_of::<(u32,u32)>()` in addition to the CSR output buffer, which can become substantial.

## Issue Context
This method is now the shared core for both `top_m_candidates` and `top_m_candidates_batched_serial_csr`, so its memory behavior directly affects large-batch call sites.

## Fix Focus Areas
- Add an explicit size/memory guard (with a clear panic message) before heap construction for extreme `nq*m_eff`.
- Consider avoiding eager `with_capacity(m_eff + 1)` for every heap (allow gradual growth), or implement a fallback strategy for extreme sizes (e.g., multi-pass over query tiles) where trading extra corpus passes for lower peak memory is acceptable.

### References
- src/sign_bitmap.rs[247-256]
- src/sign_bitmap.rs[391-414]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Missing queries-length assert~~ ✓ Resolved 🐞 Bug ≡ Correctness

Description

top_m_candidates_streamed derives nq via integer division (queries.len() / dim) and does not
assert queries.len() is a multiple of dim, so a ragged slice would silently ignore the tail.
Adding an invariant check here keeps the helper self-contained and consistent with the public
batched APIs’ validation.

Code

src/sign_bitmap.rs[R235-237]

+        let dim = self.dim;
+        let nq = queries.len() / dim;
+        let qpv = self.qwords_per_vec;

Relevance

⭐⭐⭐ High
Team consistently adds fail-loud shape invariants (len multiple of dim) in similar APIs; see
SignBitmap/RankQuant asserts.
PR-#96
PR-#30
PR-#12

ⓘ Recommendations generated based on similar findings in past PRs

Evidence

The helper currently floors nq by division and then slices per-query ranges, while the public
serial-batched entry point explicitly rejects ragged query buffers, demonstrating the intended
invariant.

src/sign_bitmap.rs[231-246]
src/sign_bitmap.rs[391-399]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`top_m_candidates_streamed` computes `nq = queries.len() / dim` without validating the slice shape, which would silently truncate ragged inputs (and skip finiteness checks on the trailing remainder) if the helper is called without prior validation.

## Issue Context
Today, the public callers validate (`top_m_candidates` asserts `q.len()==dim`; `top_m_candidates_batched_serial_csr` asserts `is_multiple_of(dim)`), but this helper is now a shared core and should defend its own invariants to prevent accidental internal misuse.

## Fix Focus Areas
- Add `debug_assert!(queries.len().is_multiple_of(dim))` (or `assert!` if you want release-mode enforcement) near the start of `top_m_candidates_streamed`.

### References
- src/sign_bitmap.rs[231-246]
- src/sign_bitmap.rs[391-399]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Audit remediation: routing top_m_candidates through the streamed core measured +50-90% at small/medium n with m in the hundreds (bounded heap O(n log m) vs select_nth_unstable_by O(n)); with one query there is no scan to share, so nq=1 stays on the dense path (parity-or-better at every measured size). Also per audit: the block-boundary oracle test now genuinely spans three blocks (the dim=128 shape fit one block), and adds the dim=768 AVX-512 tail-residue x multi-block case to the permanent suite.

Fieldnote-Echo · 2026-07-03T18:23:33Z

Adversarial audit triage (code-reviewer agent, empirical)

Verdict: no correctness or memory-safety defects. The bit-identical claim held under adversarial testing: block-boundary tie runs, all-tied corpora across m ∈ {1..1301}, tiny-n floor clamp, 70-query multi-tile × multi-block CSR, and the real AVX-512 VPOPCNTDQ kernel across tail residues (rem ∈ {0,1,4}). Heap selection is order-independent by uniqueness of (hamming, doc_id) keys; kernel sub-slicing indexes purely relative to the slice base; scores buffer fully overwritten per tile; doc-id u32 cast protected by the MAX_VECTORS=64Mi invariant.

Fixed in follow-up commit:

MEDIUM: measured +50–90% single-query regression at small/medium n (heap vs partition) — top_m_candidates (nq=1) restored to the dense select_nth_unstable_by path; only batched CSR calls use the streamed core. Parity-or-better everywhere measured.
LOW: the 'across_block_boundaries' oracle test actually fit in one block (dim=128 → 16384-doc blocks > n) — now dim=512 spanning three blocks with a partial tail; added the auditor's dim=768 tail-residue × multi-block case permanently.

Noted, accepted: q_bitmaps sizing skips checked_mul — unreachable overflow (slice length ≤ isize::MAX bounds it), stylistic inconsistency with the file's checked-allocation discipline; will fold into any future touch of that block.

Bot-review remediation (Qodo, PR #278): the shared core derived nq by integer division; a ragged buffer from a future caller would silently truncate. All current callers validate upstream — this is the cheap in-core invariant.

Fieldnote-Echo · 2026-07-04T00:10:05Z

Qodo triage: (1) ragged query buffer silently truncated — fixed in 9350b41 (debug_assert on whole-row buffers). (2) O(nq·m_eff) eager heap capacity at extreme sizes — deferred: no correctness defect (bit-identical pinned by oracle suites); a follow-up issue will track a size guard for pathological nq×m.

qodo-code-review · 2026-07-04T00:10:18Z

triage: (1) ragged query buffer silently truncated — fixed in 9350b41 (debug_assert on whole-row buffers). (2) O(nq·m_eff) eager heap capacity at extreme sizes — deferred: no correctness defect (bit-identical pinned by oracle suites); a follow-up issue will track a size guard for pathological nq×m.

That triage looks right.

finding 2 is already fixed in 9350b41.
finding 1 is a performance / capacity concern, not a correctness bug, so deferring it is reasonable. The oracle tests do pin the bit-identical contract, so a follow-up size guard for pathological nq × m_eff is enough.

If you want, I can help phrase the deferred note for the PR summary.

Bot-review remediation (Qodo, PR #278): nq * m_eff can overflow usize on 32-bit/wasm32 targets, and the CSR wrapper's saturating_mul would attempt a usize::MAX allocation. Both sites now use checked multiplication with a clear tile-the-batch message, matching the crate's checked-allocation discipline. The exact m_eff + 1 heap reservation is kept deliberately: gradual growth double-allocates to the next power of two (~2x peak per query) — the reservation is the memory-optimal choice, now documented.

Fieldnote-Echo · 2026-07-04T03:42:15Z

Qodo triage update (heap-overhead finding): remediated in cf357e5 — the real defect inside this finding was the unchecked nq × m_eff (overflowable on 32-bit/wasm32; the CSR wrapper's saturating_mul would attempt a usize::MAX allocation). Both allocation sites now use checked multiplication with a clear tile-the-batch panic, matching the crate's checked-allocation discipline.

On the suggested alternatives: gradual heap growth was evaluated and rejected as strictly worse — BinaryHeap doubles to the next power of two (~2× m peak per query) where the exact m_eff + 1 reservation is memory-optimal; this is now documented at the site. Multi-pass tiling for extreme sizes is the caller's lever by design (OrdinalDB's chunk scheduler already bounds nq × m per call via its score-cells cap); direct callers get the documented guidance. Selection state remains ~2× the CSR output the API contract already requires.

qodo-code-review · 2026-07-04T03:42:28Z

triage update (heap-overhead finding): remediated in cf357e5 — the real defect inside this finding was the unchecked nq × m_eff (overflowable on 32-bit/wasm32; the CSR wrapper's saturating_mul would attempt a usize::MAX allocation). Both allocation sites now use checked multiplication with a clear tile-the-batch panic, matching the crate's checked-allocation discipline.

On the suggested alternatives: gradual heap growth was evaluated and rejected as strictly worse — BinaryHeap doubles to the next power of two (~2× m peak per query) where the exact m_eff + 1 reservation is memory-optimal; this is now documented at the site. Multi-pass tiling for extreme sizes is the caller's lever by design (OrdinalDB's chunk scheduler already bounds nq × m per call via its score-cells cap); direct callers get the documented guidance. Selection state remains ~2× the CSR output the API contract already requires.

Acknowledged — that does address finding 1.

The real issue was the unchecked nq × m_eff allocation path, and the checked-multiplication fix with a tile-the-batch panic is the right remediation.
I also agree with the heap tradeoff conclusion: reserving m_eff + 1 is the memory-optimal choice here, and documenting the tiling guidance makes the behavior clear for extreme batch sizes.

So from my side, this looks remediated.

Bot-review remediation (Qodo, #283 inline): build_query_bitmap allocated a fresh Vec and re-validated finiteness per query; the entry points already validate the whole buffer and the destination is preallocated. Oracle suites pin bit-identical output.

Nelson Spence (Fieldnote-Echo) added 2 commits July 3, 2026 12:55

Nelson Spence (Fieldnote-Echo) requested review from Navi Bot (project-navi-bot) and Todd Baur (toadkicker) as code owners July 3, 2026 18:11

qodo-code-review Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread src/sign_bitmap.rs

Comment thread src/sign_bitmap.rs

This was referenced Jul 3, 2026

perf: parallel finite validation and scratch-based rank encode #279

Open

perf: reduce collector boundary test to a cached worst-bound compare #280

Open

fix: assert whole-row query buffers in the streamed core

9350b41

Bot-review remediation (Qodo, PR #278): the shared core derived nq by integer division; a ragged buffer from a future caller would silently truncate. All current callers validate upstream — this is the cheap in-core invariant.

Nelson Spence (Fieldnote-Echo) mentioned this pull request Jul 4, 2026

docs: changelog perf entries + 0.6.0 un-patch checklist #283

Open

project-navi-bot previously approved these changes Jul 4, 2026

View reviewed changes

Nelson Spence (Fieldnote-Echo) dismissed Navi Bot (project-navi-bot)’s stale review via cf357e5 July 4, 2026 03:41

Navi Bot (project-navi-bot) self-requested a review July 4, 2026 03:46

project-navi-bot previously approved these changes Jul 4, 2026

View reviewed changes

Nelson Spence (Fieldnote-Echo) dismissed Navi Bot (project-navi-bot)’s stale review via e091df5 July 4, 2026 04:12

Uh oh!

Uh oh!

Conversation

Fieldnote-Echo commented Jul 3, 2026

Summary

Determinism

Why (measured motivation)

Gate

Uh oh!

Fieldnote-Echo commented Jul 3, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

qodo-code-review Bot commented Jul 3, 2026

PR Summary by Qodo

Uh oh!

codecov Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

qodo-code-review Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Fieldnote-Echo commented Jul 3, 2026

Adversarial audit triage (code-reviewer agent, empirical)

Uh oh!

Fieldnote-Echo commented Jul 4, 2026

Uh oh!

qodo-code-review Bot commented Jul 4, 2026

Uh oh!

Fieldnote-Echo commented Jul 4, 2026

Uh oh!

qodo-code-review Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jul 3, 2026 •

edited

Loading

qodo-code-review Bot commented Jul 3, 2026 •

edited

Loading