Skip to content

perf: stream the corpus once per call in sign candidate generation#278

Open
Nelson Spence (Fieldnote-Echo) wants to merge 6 commits into
mainfrom
perf/tiled-batch-candgen
Open

perf: stream the corpus once per call in sign candidate generation#278
Nelson Spence (Fieldnote-Echo) wants to merge 6 commits into
mainfrom
perf/tiled-batch-candgen

Conversation

@Fieldnote-Echo

Copy link
Copy Markdown
Member

Summary

Replaces the documented-naive internals of top_m_candidates_batched_serial_csr (per-query full-corpus rescans) with a doc-major tiled scan: the corpus streams once per call in L2-sized blocks, each hot block is scored against every query of the call (32-query tiles through the existing sign_scan_collect_batched kernel), and per-query bounded min-collectors keyed (hamming, doc_id) select the exact lexicographic top-m. top_m_candidates routes through the same core and stops materialising its n-row Hamming buffer.

The swap is sanctioned by the method's own doc comment ("a future release may replace the internals with streaming top-m behind this frozen signature"). Serial contract preserved: zero rayon; callers own parallelism.

Determinism

The collector key is the contract's sort key, so selection is bit-identical to full-sort-then-truncate by construction — no dependence on block processing order. Pinned by tests/tiled_candgen.rs (committed first, green against the old implementation): independent oracle via score_all + lexicographic sort, over random corpora spanning many blocks, massive-tie and duplicate-run corpora (tie runs longer than m), edge geometries (m ≥ n, single doc, empty batch), and the dim=1024 shape.

Why (measured motivation)

At 1,258,135 rows × 1024 dims the sign sidecar is 161MB; the old path re-streamed it per query — a 2048-query batch cost 330GB of DRAM traffic (~9.3s measured, ~220 q/s). One pass per call cuts per-query traffic by the call's query count; batch becomes compute-bound. Full before/after evidence lands with the OrdinalDB integration rerun on the same corpus (evidence-discipline: this PR's synthetic n=50k committed bench is L3-resident and understates the win by design; large-corpus numbers follow in the dependent PR).

Gate

Full suite + no-default-features + clippy -D warnings + MSRV 1.89 + -D warnings build + fuzz build: green. New oracle tests: 5 suites.

Part of the 1M-row release train (Track A2 of the locked master plan). Queued behind #277.

Independent oracle (score_all + full lexicographic sort by hamming asc,
doc_id asc) pins top_m_candidates and top_m_candidates_batched_serial_csr
exactly: random corpora across block boundaries, massive-tie and
duplicate-run corpora exercising boundary tie-breaks, edge geometries
(m >= n, single doc, empty batch), and the dim=1024 shape. Must pass
bit-identically before and after the tiling swap.
top_m_candidates_batched_serial_csr previously looped the single-query
path, re-streaming the full sign bitmap per query (documented-naive
Track-1). The internals now scan the corpus once per call in L2-sized
doc blocks, score every query of the call against each hot block in
query tiles via the existing batched kernel, and select per-query
top-m with bounded (hamming, doc_id) min-collectors — bit-identical
to a full sort by construction, independent of processing order (the
key IS the contract's sort key). top_m_candidates routes through the
same core, dropping its per-call n-row Hamming materialisation.

Per-query corpus traffic drops by the call's query count: at 1.26M
rows x 1024 dims, a 2048-query call reads the 161MB sidecar once
instead of 2048 times. Serial contract preserved (no rayon); the
oracle suite (tests/tiled_candgen.rs) pins bit-identical outputs
across random, tie-heavy, duplicate-run, and edge geometries.
@Fieldnote-Echo

Copy link
Copy Markdown
Member Author

Codex (@codex) review

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

1 similar comment
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Stream sign-bitmap corpus once per call for exact top-m candidate generation

✨ Enhancement 🧪 Tests 🕐 40+ Minutes

Grey Divider

AI Description

• Replace per-query corpus rescans with a doc-major tiled scan shared by single and batched APIs.
• Select exact lexicographic top-m via per-query bounded (hamming, doc_id) collectors for
 determinism.
• Add oracle-based tests pinning outputs across random, tie-heavy, duplicate, and edge geometries.
Diagram

graph TD
A["SignBitmap candidate APIs"] --> F{"Empty batch or m==0"} -->|"no"| B["top_m_candidates_streamed"] --> C["sign_scan_collect_batched"] --> D["Per-query BinaryHeap top-m"] --> E["Top-m results (Vec/CSR)"]
F -->|"yes"| E
T["tests/tiled_candgen.rs oracle"] --> A
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Bucketed top-m by Hamming distance (counting-style)
  • ➕ Hamming is bounded (0..dim), enabling deterministic selection with low per-item overhead.
  • ➕ Potentially faster than BinaryHeap for large n with small dim (e.g., 1024).
  • ➖ More complex to implement correctly with doc_id tie-break and streaming blocks.
  • ➖ Memory can spike if many docs share low Hamming buckets; still need doc_id materialization per bucket.
2. Fixed-capacity partial selection (array + linear scan)
  • ➕ Avoids BinaryHeap’s log(m) and allocation overhead; good when m is small/medium.
  • ➕ Can be SIMD/cache friendly with tight loops.
  • ➖ O(m) update per candidate unless carefully optimized; may regress for larger m.
  • ➖ More bespoke code and harder to reason about than standard heap semantics.
3. Two-stage approximate then refine (coarse prefilter)
  • ➕ Could further reduce compute by avoiding full scoring for obviously-bad docs.
  • ➕ May open room for future ANN-like shortcuts.
  • ➖ Violates the PR’s stated goal of exact, bit-identical top-m.
  • ➖ Adds tuning knobs and risk of recall/quality regressions.

Recommendation: Keep the PR’s current streamed exact approach: it preserves the frozen ordering contract by using the contract key (hamming, doc_id) as the collector key, and it eliminates the dominant bandwidth waste (per-query rescans) without changing API semantics. If further speed is needed later, the most promising follow-up is swapping BinaryHeap for a fixed-capacity or bucketed collector while keeping the same streamed scan structure and tests as a safety net.

Files changed (2) +246 / -28

Enhancement (1) +89 / -28
sign_bitmap.rsAdd streamed/tiled exact top-m core and route single/batched CSR through it +89/-28

Add streamed/tiled exact top-m core and route single/batched CSR through it

• Introduces 'top_m_candidates_streamed', which scans the corpus once per call in L2-sized doc blocks and scores queries in small tiles via the existing batched kernel. Replaces the single-query 'top_m_candidates' full-materialization + selection with the streamed collector-based selection, and updates 'top_m_candidates_batched_serial_csr' to use the same core (including explicit empty-batch / m==0 handling) while preserving deterministic '(hamming, doc_id)' ordering.

src/sign_bitmap.rs

Tests (1) +157 / -0
tiled_candgen.rsOracle contract tests for sign candidate generation determinism and ties +157/-0

Oracle contract tests for sign candidate generation determinism and ties

• Adds an implementation-independent oracle (dense 'score_all' + full lexicographic sort) and asserts bit-identical results for both single-query and batched-CSR APIs. Covers random corpora across block boundaries, tie-heavy and duplicate-document cases, and edge geometries including empty query batches and dim=1024.

tests/tiled_candgen.rs

@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.01493% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/sign_bitmap.rs 97.01% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@qodo-code-review

qodo-code-review Bot commented Jul 3, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)

Grey Divider


Remediation recommended

1. Heap overhead scales poorly 🐞 Bug ☼ Reliability
Description
top_m_candidates_streamed allocates one BinaryHeap per query and stores (hamming, doc_id) for each
retained candidate, adding O(nq*m_eff) extra state on top of the CSR output. For large batches
and/or large m, this can significantly raise peak memory and increase the risk of allocator aborts
or severe memory pressure.
Code

src/sign_bitmap.rs[R253-255]

+        let mut heaps: Vec<BinaryHeap<(u32, u32)>> = (0..nq)
+            .map(|_| BinaryHeap::with_capacity(m_eff + 1))
+            .collect();
Relevance

⭐⭐ Medium

Mixed history: they accept allocation reductions, but have rejected some “avoid allocation/OOM”
refactors as out-of-scope.

PR-#96
PR-#12
PR-#224

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The streamed implementation constructs a per-query BinaryHeap reserved to m_eff + 1, and the
serial CSR API now routes batched candidate generation through this helper, making the additional
per-query heap memory part of the normal execution path.

src/sign_bitmap.rs[247-256]
src/sign_bitmap.rs[391-414]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`top_m_candidates_streamed` allocates `nq` heaps and eagerly reserves `m_eff + 1` capacity for each, storing `(hamming, doc_id)` keys. This increases peak memory by roughly `nq * m_eff * size_of::<(u32,u32)>()` in addition to the CSR output buffer, which can become substantial.

## Issue Context
This method is now the shared core for both `top_m_candidates` and `top_m_candidates_batched_serial_csr`, so its memory behavior directly affects large-batch call sites.

## Fix Focus Areas
- Add an explicit size/memory guard (with a clear panic message) before heap construction for extreme `nq*m_eff`.
- Consider avoiding eager `with_capacity(m_eff + 1)` for every heap (allow gradual growth), or implement a fallback strategy for extreme sizes (e.g., multi-pass over query tiles) where trading extra corpus passes for lower peak memory is acceptable.

### References
- src/sign_bitmap.rs[247-256]
- src/sign_bitmap.rs[391-414]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational

2. Missing queries-length assert ✓ Resolved 🐞 Bug ≡ Correctness
Description
top_m_candidates_streamed derives nq via integer division (queries.len() / dim) and does not
assert queries.len() is a multiple of dim, so a ragged slice would silently ignore the tail.
Adding an invariant check here keeps the helper self-contained and consistent with the public
batched APIs’ validation.
Code

src/sign_bitmap.rs[R235-237]

+        let dim = self.dim;
+        let nq = queries.len() / dim;
+        let qpv = self.qwords_per_vec;
Relevance

⭐⭐⭐ High

Team consistently adds fail-loud shape invariants (len multiple of dim) in similar APIs; see
SignBitmap/RankQuant asserts.

PR-#96
PR-#30
PR-#12

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The helper currently floors nq by division and then slices per-query ranges, while the public
serial-batched entry point explicitly rejects ragged query buffers, demonstrating the intended
invariant.

src/sign_bitmap.rs[231-246]
src/sign_bitmap.rs[391-399]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`top_m_candidates_streamed` computes `nq = queries.len() / dim` without validating the slice shape, which would silently truncate ragged inputs (and skip finiteness checks on the trailing remainder) if the helper is called without prior validation.

## Issue Context
Today, the public callers validate (`top_m_candidates` asserts `q.len()==dim`; `top_m_candidates_batched_serial_csr` asserts `is_multiple_of(dim)`), but this helper is now a shared core and should defend its own invariants to prevent accidental internal misuse.

## Fix Focus Areas
- Add `debug_assert!(queries.len().is_multiple_of(dim))` (or `assert!` if you want release-mode enforcement) near the start of `top_m_candidates_streamed`.

### References
- src/sign_bitmap.rs[231-246]
- src/sign_bitmap.rs[391-399]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread src/sign_bitmap.rs
Comment thread src/sign_bitmap.rs
Audit remediation: routing top_m_candidates through the streamed core
measured +50-90% at small/medium n with m in the hundreds (bounded
heap O(n log m) vs select_nth_unstable_by O(n)); with one query there
is no scan to share, so nq=1 stays on the dense path (parity-or-better
at every measured size). Also per audit: the block-boundary oracle
test now genuinely spans three blocks (the dim=128 shape fit one
block), and adds the dim=768 AVX-512 tail-residue x multi-block case
to the permanent suite.
@Fieldnote-Echo

Copy link
Copy Markdown
Member Author

Adversarial audit triage (code-reviewer agent, empirical)

Verdict: no correctness or memory-safety defects. The bit-identical claim held under adversarial testing: block-boundary tie runs, all-tied corpora across m ∈ {1..1301}, tiny-n floor clamp, 70-query multi-tile × multi-block CSR, and the real AVX-512 VPOPCNTDQ kernel across tail residues (rem ∈ {0,1,4}). Heap selection is order-independent by uniqueness of (hamming, doc_id) keys; kernel sub-slicing indexes purely relative to the slice base; scores buffer fully overwritten per tile; doc-id u32 cast protected by the MAX_VECTORS=64Mi invariant.

Fixed in follow-up commit:

  • MEDIUM: measured +50–90% single-query regression at small/medium n (heap vs partition) — top_m_candidates (nq=1) restored to the dense select_nth_unstable_by path; only batched CSR calls use the streamed core. Parity-or-better everywhere measured.
  • LOW: the 'across_block_boundaries' oracle test actually fit in one block (dim=128 → 16384-doc blocks > n) — now dim=512 spanning three blocks with a partial tail; added the auditor's dim=768 tail-residue × multi-block case permanently.

Noted, accepted: q_bitmaps sizing skips checked_mul — unreachable overflow (slice length ≤ isize::MAX bounds it), stylistic inconsistency with the file's checked-allocation discipline; will fold into any future touch of that block.

Bot-review remediation (Qodo, PR #278): the shared core derived nq by
integer division; a ragged buffer from a future caller would silently
truncate. All current callers validate upstream — this is the cheap
in-core invariant.
@Fieldnote-Echo

Copy link
Copy Markdown
Member Author

Qodo triage: (1) ragged query buffer silently truncated — fixed in 9350b41 (debug_assert on whole-row buffers). (2) O(nq·m_eff) eager heap capacity at extreme sizes — deferred: no correctness defect (bit-identical pinned by oracle suites); a follow-up issue will track a size guard for pathological nq×m.

@qodo-code-review

Copy link
Copy Markdown

triage: (1) ragged query buffer silently truncated — fixed in 9350b41 (debug_assert on whole-row buffers). (2) O(nq·m_eff) eager heap capacity at extreme sizes — deferred: no correctness defect (bit-identical pinned by oracle suites); a follow-up issue will track a size guard for pathological nq×m.

That triage looks right.

  • finding 2 is already fixed in 9350b41.
  • finding 1 is a performance / capacity concern, not a correctness bug, so deferring it is reasonable. The oracle tests do pin the bit-identical contract, so a follow-up size guard for pathological nq × m_eff is enough.

If you want, I can help phrase the deferred note for the PR summary.

Bot-review remediation (Qodo, PR #278): nq * m_eff can overflow usize
on 32-bit/wasm32 targets, and the CSR wrapper's saturating_mul would
attempt a usize::MAX allocation. Both sites now use checked
multiplication with a clear tile-the-batch message, matching the
crate's checked-allocation discipline. The exact m_eff + 1 heap
reservation is kept deliberately: gradual growth double-allocates to
the next power of two (~2x peak per query) — the reservation is the
memory-optimal choice, now documented.
@Fieldnote-Echo

Copy link
Copy Markdown
Member Author

Qodo triage update (heap-overhead finding): remediated in cf357e5 — the real defect inside this finding was the unchecked nq × m_eff (overflowable on 32-bit/wasm32; the CSR wrapper's saturating_mul would attempt a usize::MAX allocation). Both allocation sites now use checked multiplication with a clear tile-the-batch panic, matching the crate's checked-allocation discipline.

On the suggested alternatives: gradual heap growth was evaluated and rejected as strictly worseBinaryHeap doubles to the next power of two (~2× m peak per query) where the exact m_eff + 1 reservation is memory-optimal; this is now documented at the site. Multi-pass tiling for extreme sizes is the caller's lever by design (OrdinalDB's chunk scheduler already bounds nq × m per call via its score-cells cap); direct callers get the documented guidance. Selection state remains ~2× the CSR output the API contract already requires.

@qodo-code-review

Copy link
Copy Markdown

triage update (heap-overhead finding): remediated in cf357e5 — the real defect inside this finding was the unchecked nq × m_eff (overflowable on 32-bit/wasm32; the CSR wrapper's saturating_mul would attempt a usize::MAX allocation). Both allocation sites now use checked multiplication with a clear tile-the-batch panic, matching the crate's checked-allocation discipline.

On the suggested alternatives: gradual heap growth was evaluated and rejected as strictly worseBinaryHeap doubles to the next power of two (~2× m peak per query) where the exact m_eff + 1 reservation is memory-optimal; this is now documented at the site. Multi-pass tiling for extreme sizes is the caller's lever by design (OrdinalDB's chunk scheduler already bounds nq × m per call via its score-cells cap); direct callers get the documented guidance. Selection state remains ~2× the CSR output the API contract already requires.

Acknowledged — that does address finding 1.

  • The real issue was the unchecked nq × m_eff allocation path, and the checked-multiplication fix with a tile-the-batch panic is the right remediation.
  • I also agree with the heap tradeoff conclusion: reserving m_eff + 1 is the memory-optimal choice here, and documenting the tiling guidance makes the behavior clear for extreme batch sizes.

So from my side, this looks remediated.

Bot-review remediation (Qodo, #283 inline): build_query_bitmap
allocated a fresh Vec and re-validated finiteness per query; the
entry points already validate the whole buffer and the destination
is preallocated. Oracle suites pin bit-identical output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants