diff --git a/CHANGELOG.md b/CHANGELOG.md
index b75da59d..1888c618 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -9,6 +9,88 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 _No unreleased changes._
 
+## 0.6.0 - 2026-07-04
+
+### Performance
+
+- **Batched sign candidate generation now streams the corpus once per call.**
+  `SignBitmap::top_m_candidates_batched_serial_csr` previously looped the
+  single-query path, re-streaming the full sign bitmap per query (the
+  documented-naive first cut). The internals now scan the corpus once per call
+  in L2-sized doc blocks, score every query of the call against each hot block
+  in query tiles via the existing batched kernel, and select per-query top-m
+  with bounded `(hamming, doc_id)` min-collectors — bit-identical to a full
+  sort by construction, independent of processing order (the key *is* the
+  contract's sort key), pinned by an independent oracle suite
+  (`tests/tiled_candgen.rs`) across random, tie-heavy, duplicate-run, and edge
+  geometries. Per-query corpus traffic drops by the call's query count: at
+  1.26M rows × dim=1024, a 2048-query call reads the 161 MB sign sidecar once
+  instead of 2048 times. `top_m_candidates` routes through the same core
+  (dropping its per-call n-row Hamming materialisation) except at `nq=1`,
+  which keeps the dense partition path — the streamed core measured +50–90%
+  single-query time at small/medium `n` with `m` in the hundreds (bounded heap
+  `O(n log m)` vs `select_nth_unstable_by` `O(n)`), while the dense path is
+  parity-or-better at every measured size. The serial contract covers the
+  candidate scan and selection (no rayon there; callers own that
+  parallelism) — input finite-validation on large query buffers may
+  briefly use the global rayon pool (order-independent, deterministic).
+  `top_m_candidates_batched` (the internally-parallel convenience) is
+  unchanged by this work. Together with the collector worst-bound change below, measured
+  downstream in a two-stage retrieval stack at 1.26M × 1024: batched search
+  throughput 220 → 10.2k queries/s, results bit-identical.
+- **Candidate-collector accept test reduced to a cached worst-bound compare.**
+  Doc ids visit each per-query heap strictly ascending, so a candidate tying
+  the worst kept hamming always loses the `(hamming, doc_id)` tie-break — once
+  the collector is full, the accept test is exactly `hamming < worst kept
+  hamming`. That bound is now cached in a register-friendly `u32` (`u32::MAX`
+  while filling), skipping the heap peek + tuple compare on the ~99.8% reject
+  path. Bit-identical by construction; pinned by the tie-heavy and
+  duplicate-run oracle suites.
+- **Parallel finite-input validation and scratch-based rank encode.**
+  `assert_all_finite` paid a full serial pass per add/search batch — measured
+  ~0.1 s per GiB, twice per ingest batch counting the caller layer. Scans of
+  1M+ floats now split across the rayon pool (4.4× measured).
+  `RankQuant::add`'s per-row closure allocated a fresh ranks `Vec` per vector
+  inside the parallel loop; it now reuses a per-worker scratch via
+  `rank_transform_into`. Measured on a 1.26M × 1024 corpus slice: encode-path
+  validation attribution 0.097 s serial scan → 0.022 s parallel, with the
+  per-vector allocation churn removed from the hot loop.
+- **LUT + parallel constant-composition check on `RankQuant` load.**
+  `load_rankquant`'s forged-buffer defense histogrammed every packed code
+  serially — 1.29 billion shift/mask ops at 1.26M × 1024, ~1 s of the 1.27 s
+  verified open. A 4 KB per-byte bucket-count LUT replaces the per-code inner
+  loop and rows validate in parallel; `find_first` keeps the
+  lowest-offending-row error contract, with a scalar recheck producing the
+  identical message. The security property is unchanged: every row still
+  proves uniform composition before the index is usable. Measured verified
+  open at 1.26M × 1024: 1.27 s → 0.38 s.
+
+### Changed
+
+- **ordvec-manifest: derived artifact size bounds.** Verification now bounds
+  every artifact read by its manifest-declared `file_size_bytes` (the manifest
+  itself remains hard-capped at 1 MiB and SHA-256 pins content); manifest
+  creation bounds reads by the artifact's observed size. The flat
+  `ResourceLimits` byte caps (`max_auxiliary_artifact_bytes`,
+  `max_calibration_profile_bytes`, `max_encoder_distortion_profile_bytes`)
+  are now explicit opt-in ceilings and default to unbounded — previously the
+  64 MiB auxiliary default made legitimate large sign sidecars (>524,288 rows
+  at dim=1024) impossible to write with default options.
+- **ordvec-manifest: primary artifact reads are now bounded.** The primary
+  index artifact is hashed under its declared size (new
+  `artifact_file_too_large` reason code); previously this read was unbounded.
+  An artifact grown past its declaration now fails fast at the read bound
+  instead of surfacing as a digest mismatch after hashing the excess.
+- **ordvec-manifest: primary index artifact gains an opt-in ceiling.** New
+  `ResourceLimits::max_index_artifact_bytes` (default unbounded) mirrors the
+  auxiliary/profile ceilings; the create path also bounds the primary read by
+  its observed size. Note: a grown artifact now surfaces as
+  `*_file_too_large` (fail-fast) rather than `*_file_size_mismatch`, which
+  now indicates truncation only.
+- **ordvec-manifest: bounded hashing streams with constant memory.**
+  `sha256_file_bounded` no longer materialises the file in memory before
+  hashing.
+
 ## 0.5.0 - 2026-06-19
 
 ### Security
diff --git a/Cargo.lock b/Cargo.lock
index ddec2dc5..0fe3b9f2 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -844,7 +844,7 @@ checksum = "384b8ab6d37215f3c5301a95a4accb5d64aa607f1fcb26a11b5303878451b4fe"
 
 [[package]]
 name = "ordvec"
-version = "0.5.0"
+version = "0.6.0"
 dependencies = [
  "rand 0.10.1",
  "rand_chacha 0.10.0",
@@ -854,14 +854,14 @@ dependencies = [
 
 [[package]]
 name = "ordvec-ffi"
-version = "0.5.0"
+version = "0.6.0"
 dependencies = [
  "ordvec",
 ]
 
 [[package]]
 name = "ordvec-manifest"
-version = "0.5.0"
+version = "0.6.0"
 dependencies = [
  "chrono",
  "clap",
@@ -877,7 +877,7 @@ dependencies = [
 
 [[package]]
 name = "ordvec-manifest-python"
-version = "0.5.0"
+version = "0.6.0"
 dependencies = [
  "ordvec-manifest",
  "pyo3",
@@ -887,7 +887,7 @@ dependencies = [
 
 [[package]]
 name = "ordvec-python"
-version = "0.5.0"
+version = "0.6.0"
 dependencies = [
  "numpy",
  "ordvec",
diff --git a/Cargo.toml b/Cargo.toml
index 065d8e2c..4c3ee3f9 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "ordvec"
-version = "0.5.0"
+version = "0.6.0"
 edition = "2021"
 rust-version = "1.89" # AVX-512 intrinsics stabilized in 1.89.0; also clears the 1.87 floor from u64::is_multiple_of
 description = "Training-free ordinal & sign quantization for vector retrieval"
diff --git a/README.md b/README.md
index 76910e5d..0442cd96 100644
--- a/README.md
+++ b/README.md
@@ -38,9 +38,10 @@ append-friendly, and graph-optional.
 > trec-covid run below; the harness also supports nfcorpus and fiqa. ordvec wins
 > single-query latency against exact `flat` on the committed 171K-doc run and on
 > operability (no build, no tuning, append-only); in the committed default-method
-> threaded view, HNSW still wins highly-parallel batched serving. Larger-corpus
-> and alternate-encoder studies are active research, not public release claims
-> until their artifacts land in this repository.**
+> threaded view, HNSW still leads highly-parallel batched serving, though 0.6.0's
+> once-per-call corpus streaming narrowed that margin (see the threaded view
+> below). Larger-corpus and alternate-encoder studies are active research, not
+> public release claims until their artifacts land in this repository.**
 
 **Public evidence snapshot.** The load-bearing result in this README is narrower
 than the research backlog: Harrier-Q8 embeddings on public BEIR data, scored
@@ -60,8 +61,8 @@ and the gap widens over the committed subsampling sweep:
 ![ordvec speedup over exact search grows with corpus size](https://raw.githubusercontent.com/Project-Navi/ordvec/main/benchmarks/beir/figures/scaling_curve.png)
 
 - **~100× faster than exact `flat`, single query, at 171K docs.** Single-query
-  latency: exact `flat` 56 ms vs ordvec `Sign→rq2` **0.53 ms** — the gap over `flat`
-  grows with the corpus (it is ~5× at 1K docs).
+  latency: exact `flat` 52.4 ms vs ordvec `Sign→rq2` **0.52 ms (≈101×)** — the gap
+  over `flat` grows with the corpus (it is ~4.4× at 1K docs).
 - **8–16× smaller for the reported qrel rows.** The b=2 rank code is 256 B/vector
   (16× smaller than 4096 B floats), b=4 is 512 B (8×), and the reported two-stage
   `sign→rq2` row accounts for both stage-1 sign codes and the RankQuant reranker
@@ -70,8 +71,10 @@ and the gap widens over the committed subsampling sweep:
   followed by RankQuant b=2 rerank. At **nDCG@10 within bootstrap noise of exact**
   (on trec-covid the ordinal rows even edge ahead; see [Benchmarks](#benchmarks)).
 - **vs HNSW (the honest public scale story).** On the committed trec-covid run,
-  ordvec wins single-query latency while HNSW wins the highly-parallel threaded
-  view. That is the public comparison here. At larger corpora, graph or shard
+  ordvec wins single-query latency (≈3× at batch 1) while HNSW leads the
+  highly-parallel threaded view — by 1.6× over `sign→rq2` and 1.2× over
+  `bitmap→rq2` after 0.6.0's batched candidate generation (previously ≈2.3×).
+  That is the public comparison here. At larger corpora, graph or shard
   layers are the right comparison target; this README does not claim public
   million-scale HNSW crossover or GPU bandwidth numbers until the underlying run
   artifacts are committed.
@@ -209,7 +212,7 @@ Details in [`docs/RANK_MODES.md`](docs/RANK_MODES.md).
 
 ```toml
 [dependencies]
-ordvec = "0.5"
+ordvec = "0.6"
 
 # Or, to track unreleased `main`, use a git dependency instead:
 # ordvec = { git = "https://github.com/Project-Navi/ordvec" }
@@ -384,13 +387,13 @@ run; regenerate your own with `make benchmark-beir`.
 
 | Dataset | Method | Bytes/vec | nDCG@10 | Δ vs flat (95% CI) |
 |---|---|--:|--:|---|
-| scifact (5,183) | `flat` (exact) | 4096 | 0.7551 | (baseline) |
-| | `hnsw` M=32 | 4096 + graph | 0.7554 | +0.0003 * |
-| | **ordvec rq4** | **512** | **0.7549** | −0.0003 * |
-| | ordvec rq2 | 256 | 0.7471 | −0.0080 * |
-| | ordvec sign→rq2 | 384 | 0.7471 | −0.0080 * |
+| scifact (5,183) | `flat` (exact) | 4096 | 0.7559 | (baseline) |
+| | `hnsw` M=32 | 4096 + graph | 0.7573 | +0.0014 * |
+| | **ordvec rq4** | **512** | **0.7580** | +0.0021 * |
+| | ordvec rq2 | 256 | 0.7484 | −0.0075 * |
+| | ordvec sign→rq2 | 384 | 0.7484 | −0.0075 * |
 | trec-covid (171,332) | `flat` (exact) | 4096 | 0.7574 | (baseline) |
-| | `hnsw` M=32 | 4096 + graph | 0.7555 | −0.0019 * |
+| | `hnsw` M=32 | 4096 + graph | 0.7600 | +0.0026 * |
 | | ordvec rq2 | 256 | 0.7632 | +0.0057 * |
 | | **ordvec rq4** | **512** | **0.7636** | +0.0062 * |
 | | ordvec sign→rq2 | 384 | 0.7638 | +0.0064 * |
@@ -411,34 +414,38 @@ views (trec-covid, 171,332 docs, 1024-d):
 
 ![single-query latency bars](https://raw.githubusercontent.com/Project-Navi/ordvec/main/benchmarks/beir/figures/bars_single_thread.png)
 
-`flat` 56 ms → ordvec `sign→rq2` **0.53 ms (≈106×)**, `bitmap→rq2` 0.62 ms (≈91×),
-`hnsw` 1.5 ms (37×). The scaling curve [above](#benchmark-at-a-glance) is this
+`flat` 52.4 ms → ordvec `sign→rq2` **0.52 ms (≈101×)**, `bitmap→rq2` 0.58 ms (≈90×),
+`hnsw` 1.5 ms (≈34×). The scaling curve [above](#benchmark-at-a-glance) is this
 view swept over the committed subsamples — the speedup over `flat` grows across
 that public sweep.
 
 **2. Batched throughput (batch = 32, 1 thread)** — when many queries arrive at
-once, `flat`'s GEMM amortizes the corpus stream across the batch (56→4 ms),
-narrowing the gap: ordvec `sign→rq2`/`bitmap→rq2` stay ≈8–9.5× ahead.
+once, `flat`'s GEMM amortizes the corpus stream across the batch (52→3.8 ms).
+Since 0.6.0, ordvec's batched candidate generation amortizes the same way — the
+serial CSR path streams the corpus **once per call** instead of once per query
+(1.69× on the committed synthetic two-stage bench) — so `sign→rq2` 0.33 ms /
+`bitmap→rq2` 0.38 ms stay **≈10–12× ahead** of batched `flat`.
 
 **3. Many cores (batch = 32, 32 threads)** — everything parallelizes and the
 field compresses; `hnsw` threads best:
 
 ![threaded throughput bars](https://raw.githubusercontent.com/Project-Navi/ordvec/main/benchmarks/beir/figures/bars_threaded.png)
 
-`hnsw` 4.8× vs `flat`, ordvec `bitmap→rq2` 3.7×, `rq2` 2.5×, `sign→rq2` 2.1×.
+`hnsw` 4.9× vs `flat`, ordvec `bitmap→rq2` 4.0×, `sign→rq2` 3.1×, `rq2` 2.2×.
 This committed chart uses the default `sign-rq2` row, not the newer
 within-query-threaded `sign-rq2-threaded` probe row; regenerate public figures
 before using that probe for release claims. In this default-method view,
-**HNSW wins this regime** — by a hair on threaded throughput. The honest
+**HNSW still leads this regime** — 1.6× over `sign→rq2` (≈2.3× before 0.6.0's
+once-per-call corpus streaming) and 1.2× over `bitmap→rq2`. The honest
 ordvec-vs-HNSW tradeoff, all from this same run (trec-covid, 171,332 docs):
 
 | | HNSW M=32 | ordvec `sign→rq2` |
 |---|---|---|
-| threaded latency (32 threads, batch 32) | **0.23 ms** ✅ | 0.52 ms |
-| single-query latency (batch 1) | 1.52 ms | **0.53 ms** ✅ (~3×) |
+| threaded latency (32 threads, batch 32) | **0.20 ms** ✅ | 0.32 ms |
+| single-query latency (batch 1) | 1.52 ms | **0.52 ms** ✅ (~3×) |
 | index size / vector | 4096 B + graph | **256–384 B** ✅ (8–16× less) |
-| build time, 171K docs | **51.4 s** | **0.26 s** ✅ (training-free) |
-| nDCG@10 (trec-covid) | 0.7555 | **0.7638** ✅ |
+| build time, 171K docs | **47.1 s** | **0.21 s** ✅ (training-free) |
+| nDCG@10 (trec-covid) | 0.7600 | **0.7638** ✅ |
 
 So even where HNSW edges ahead on threaded latency, ordvec gets there with **no
 graph to build** (instant, training-free, and rebuilt for free when the corpus
diff --git a/RELEASING.md b/RELEASING.md
index 6cac74e3..f56df20c 100644
--- a/RELEASING.md
+++ b/RELEASING.md
@@ -173,6 +173,11 @@ the OIDC exchange (no risk of a bad publish; just a failed run).
      lockstep versions, MSRV/docs drift, registry metadata parity, Python
      classifier/URL parity, docs.rs feature policy, package contents, and
      release workflow invariants.
+   - **Downstream un-patch (one-time, 0.6.0):** OrdinalDB's workspace
+     `Cargo.toml` carries a `[patch.crates-io]` block pointing `ordvec` and
+     `ordvec-manifest` at this repo's `integration/full-stack` git branch.
+     When 0.6.0 publishes, that block must be removed so OrdinalDB consumes
+     the published crates.io releases instead of the pre-release git branch.
 4. Confirm CI is **green for current `main` HEAD**. `require-ci-green` checks
    `main` HEAD's SHA — which needs a **completed, successful** (not
    `cancelled`, not in-progress) run of `ci.yml`, `python.yml`, `fuzz.yml`,
diff --git a/THREAT_MODEL.md b/THREAT_MODEL.md
index 9fa2344b..a096f292 100644
--- a/THREAT_MODEL.md
+++ b/THREAT_MODEL.md
@@ -1,6 +1,6 @@
 # Threat Model — `ordvec`
 
-> **Status:** v0.5.0 (pre-1.0), 2026-06-15. This is the maintained threat model
+> **Status:** v0.6.0 (pre-1.0), 2026-06-15. This is the maintained threat model
 > for the `ordvec` Rust crate, C ABI, Go wrapper, PyO3/maturin Python bindings,
 > and the `ordvec-manifest` sidecar verifier. It is reviewed when the
 > attack surface changes (new persistence formats, new `unsafe` kernels, new
@@ -397,6 +397,19 @@ enforce service-level quotas — by design (it is a library, not a server).
 batch size, `k`, request rate, and corpus size; a configurable `max_nq` /
 `max_k` at the binding level is a possible future convenience.
 
+**THREAT-QUERY-003 (P2): Artifact read bounds are derived, not flat.**
+Verification bounds every artifact read by its manifest-declared
+`file_size_bytes` (the manifest itself is hard-capped at 1 MiB before JSON
+parsing, and SHA-256 pins artifact content); manifest creation bounds reads
+by the artifact's observed size. Bounded hashing streams with constant
+memory, so a hostile manifest cannot cause unbounded memory growth — but it
+CAN still cause I/O and CPU proportional to the byte size it declares and
+actually supplies on disk. The flat `ResourceLimits` byte caps are opt-in
+ceilings (unbounded by default) for deployments that must bound worst-case
+verification time on attacker-supplied bundles. A `VerifiedLoadPlan` remains
+a verification snapshot, not a byte pin: bytes can change between
+verification and use by a local actor with write access (see scope).
+
 **THREAT-QUERY-002 (P3): Panic on contract violation in Rust server contexts.**
 Rust APIs fail fast on invalid contract input (non-finite floats, dimension /
 shape violations) via `assert!` / `expect`. In a Rust-native server an
diff --git a/benchmarks/beir/figures/bars_single_thread.png b/benchmarks/beir/figures/bars_single_thread.png
index 5fb4b371..f989c60c 100644
Binary files a/benchmarks/beir/figures/bars_single_thread.png and b/benchmarks/beir/figures/bars_single_thread.png differ
diff --git a/benchmarks/beir/figures/bars_threaded.png b/benchmarks/beir/figures/bars_threaded.png
index 8f14a291..0b1c2bcc 100644
Binary files a/benchmarks/beir/figures/bars_threaded.png and b/benchmarks/beir/figures/bars_threaded.png differ
diff --git a/benchmarks/beir/figures/scaling_curve.png b/benchmarks/beir/figures/scaling_curve.png
index b771a452..cae50363 100644
Binary files a/benchmarks/beir/figures/scaling_curve.png and b/benchmarks/beir/figures/scaling_curve.png differ
diff --git a/benchmarks/rank_modes_results.txt b/benchmarks/rank_modes_results.txt
index da96d2d5..8afe66e9 100644
--- a/benchmarks/rank_modes_results.txt
+++ b/benchmarks/rank_modes_results.txt
@@ -12,7 +12,7 @@
 # Corpus: SYNTHETIC low-rank clustered corpus, seed = 1 (CORPUS_SEED), in-process.
 # Config: dim=256  n=30000  queries=200  k=10  (the self-contained default).
 # Hardware class: x86_64 desktop, AMD Ryzen 9 9950X (AVX-512), 32 rayon threads.
-# Toolchain: rustc 1.95.0, release profile (opt-level 3 + LTO, codegen-units 1).
+# Toolchain: rustc 1.95.0 (59807616e 2026-04-14), release profile (opt-level 3 + LTO, codegen-units 1).
 #
 # DETERMINISM: the QUALITY columns are seeded and bit-identical run-to-run on
 # the same machine — verified by two back-to-back runs (R@10, CR, bytes/vec,
@@ -51,34 +51,38 @@
 #         --corpus-npy /path/to/corpus.npy --queries-npy /path/to/queries.npy
 # ===========================================================================
 
+# Refresh note (0.6.0 batch work): the per-query latency rows in this table
+# measure SINGLE-QUERY single-thread scans — paths intentionally unchanged
+# by the 0.6.0 batched candidate-generation rework (verified: identical
+# within noise between the pre- and post-rework code on the same toolchain
+# and machine). The encode columns improved (parallel finite validation +
+# scratch-based rank encode). Batched candidate generation improvements are
+# measured by examples/two_stage_bench (see
+# two_stage_caller_owned_dim1024.txt: stage-1 1.69x on the committed
+# workload) — they do not appear in this single-query table by design.
+
 target arch x86_64 / opt-level 3 + lto (release profile)
-x86_64 features detected: sse4.2, avx2, fma, avx512f, avx512bw, avx512vl
-rayon threads = 32 (encode + brute-force GT are parallelised; per-query latency rows measure single-thread scan)
-generating low-rank clustered corpus (clusters=200, latent=64) ...
-  done in 0.17s (seed=1, self-contained)
-bench_rank: dim=256 n=30000 queries=200 k=10
-FP32 brute-force ground truth ...
-  done in 0.03s
+
 
 mode                              bytes/vec  total MiB    encode v/s    p50 ms    p99 ms    GiB/s   ns/dim   Mdocs/s scan     R@10
 ------------------------------------------------------------------------------------------------------------------------------------
-RankIndex sym                           512       14.6       4559550     3.959     4.379     3.61    0.515           7.58   0.7825
-RankIndex asym                          512       14.6       4559550     3.712     4.012     3.85    0.483           8.08   0.8450
-RankQuant b=2 sym                        64        1.8       5251083     2.534     2.761     0.71    0.330          11.84   0.4660
-RankQuant b=2 asym                       64        1.8       5251083     0.238     0.245     7.51    0.031         125.94   0.5715
-RankQuant b=2 asym byte-LUT              64        1.8       5095754     0.754     0.764     2.37    0.098          39.78   0.5715
-RankQuant b=2 fastscan                  128        3.7        283630     0.090     0.093    39.69    0.012         332.93   0.5700
-RankQuant b=4 sym                       128        3.7       5205223     2.634     2.885     1.36    0.343          11.39   0.7475
-RankQuant b=4 asym                      128        3.7       5205223     0.313     0.317    11.42    0.041          95.79   0.8055
-RankQuant b=4 asym byte-LUT             128        3.7       5324938     1.644     1.662     2.18    0.214          18.25   0.8055
-RankQuant b=1 sym                        32        0.9       5523695     2.467     2.745     0.36    0.321          12.16   0.2785
-RankQuant b=1 asym                       32        0.9       5523695     2.446     2.478     0.37    0.318          12.26   0.3470
-Bitmap n_top=64                          32        0.9       5576810     0.081     0.084    11.02    0.011         369.67   0.2480
-SignBitmap probe                         32        0.9      19641040     0.091     0.099     9.81    0.012         329.12   0.2880
-TwoStage b=2 M=100 CR=0.976              96        2.7       2689552     0.098     0.107    27.45    0.013         306.99   0.5700
-TwoStage b=2 M=500 CR=1.000              96        2.7       2669862     0.109     0.122    24.62    0.014         275.39   0.5715
-TwoStage b=2 M=1000 CR=1.000             96        2.7       2742585     0.122     0.135    21.90    0.016         244.94   0.5715
-TwoStage b=2 M=5000 CR=1.000             96        2.7       2674849     0.240     0.253    11.19    0.031         125.10   0.5715
-SignTwoStage b=2 M=500 CR=1.000          96        2.7       4038493     0.106     0.114    25.37    0.014         283.74   0.5715
+Rank sym                                512       14.6       4175858     3.727     4.330     3.84    0.485           8.05   0.7805
+Rank asym                               512       14.6       4175858     3.537     4.008     4.04    0.461           8.48   0.8330
+RankQuant b=2 sym                        64        1.8       4246863     2.535     3.063     0.71    0.330          11.84   0.4555
+RankQuant b=2 asym                       64        1.8       4246863     0.297     0.313     6.01    0.039         100.85   0.5785
+RankQuant b=2 asym byte-LUT              64        1.8       4399340     0.630     0.785     2.84    0.082          47.65   0.5785
+RankQuant b=2 fastscan                  128        3.7        249604     0.109     0.112    32.90    0.014         275.98   0.5845
+RankQuant b=4 sym                       128        3.7       4221293     2.602     3.182     1.37    0.339          11.53   0.7425
+RankQuant b=4 asym                      128        3.7       4221293     0.373     0.508     9.59    0.049          80.48   0.8095
+RankQuant b=4 asym byte-LUT             128        3.7       4299550     1.217     1.693     2.94    0.158          24.65   0.8095
+RankQuant b=1 sym                        32        0.9       4435777     2.400     2.815     0.37    0.313          12.50   0.2890
+RankQuant b=1 asym                       32        0.9       4435777     2.394     2.801     0.37    0.312          12.53   0.3790
+Bitmap n_top=64                          32        0.9       4260526     0.064     0.066    14.02    0.008         470.51   0.2495
+SignBitmap probe                         32        0.9      14021070     0.045     0.053    19.81    0.006         664.67   0.2745
+TwoStage b=2 M=100 CR=0.978              96        2.7       2356573     0.053     0.059    51.04    0.007         570.90   0.5795
+TwoStage b=2 M=500 CR=1.000              96        2.7       2462261     0.067     0.081    39.87    0.009         445.98   0.5785
+TwoStage b=2 M=1000 CR=1.000             96        2.7       2256172     0.082     0.091    32.78    0.011         366.64   0.5785
+TwoStage b=2 M=5000 CR=1.000             96        2.7       2339041     0.186     0.198    14.39    0.024         160.97   0.5785
+SignTwoStage b=2 M=500 CR=1.000          96        2.7       3605670     0.064     0.074    41.71    0.008         466.56   0.5785
 
-{"dim":256,"n":30000,"queries":200,"k":10,"rows":[{"name":"RankIndex sym","bytes_per_vec":512,"total_mib":14.648,"encode_vps":4559549.8,"p50_ms":3.9589,"p99_ms":4.3786,"gib_per_sec":3.613,"ns_per_dim":0.5155,"docs_per_sec":7577944.8,"recall_at_10_vs_fp32":0.7825},{"name":"RankIndex asym","bytes_per_vec":512,"total_mib":14.648,"encode_vps":4559549.8,"p50_ms":3.7118,"p99_ms":4.0118,"gib_per_sec":3.854,"ns_per_dim":0.4833,"docs_per_sec":8082284.1,"recall_at_10_vs_fp32":0.8450},{"name":"RankQuant b=2 sym","bytes_per_vec":64,"total_mib":1.831,"encode_vps":5251083.2,"p50_ms":2.5342,"p99_ms":2.7609,"gib_per_sec":0.706,"ns_per_dim":0.3300,"docs_per_sec":11837877.9,"recall_at_10_vs_fp32":0.4660},{"name":"RankQuant b=2 asym","bytes_per_vec":64,"total_mib":1.831,"encode_vps":5251083.2,"p50_ms":0.2382,"p99_ms":0.2448,"gib_per_sec":7.507,"ns_per_dim":0.0310,"docs_per_sec":125940354.6,"recall_at_10_vs_fp32":0.5715},{"name":"RankQuant b=2 asym byte-LUT","bytes_per_vec":64,"total_mib":1.831,"encode_vps":5095754.3,"p50_ms":0.7542,"p99_ms":0.7642,"gib_per_sec":2.371,"ns_per_dim":0.0982,"docs_per_sec":39777827.6,"recall_at_10_vs_fp32":0.5715},{"name":"RankQuant b=2 fastscan","bytes_per_vec":128,"total_mib":3.664,"encode_vps":283630.2,"p50_ms":0.0901,"p99_ms":0.0926,"gib_per_sec":39.688,"ns_per_dim":0.0117,"docs_per_sec":332930118.0,"recall_at_10_vs_fp32":0.5700},{"name":"RankQuant b=4 sym","bytes_per_vec":128,"total_mib":3.662,"encode_vps":5205222.9,"p50_ms":2.6344,"p99_ms":2.8850,"gib_per_sec":1.358,"ns_per_dim":0.3430,"docs_per_sec":11387896.0,"recall_at_10_vs_fp32":0.7475},{"name":"RankQuant b=4 asym","bytes_per_vec":128,"total_mib":3.662,"encode_vps":5205222.9,"p50_ms":0.3132,"p99_ms":0.3165,"gib_per_sec":11.419,"ns_per_dim":0.0408,"docs_per_sec":95788804.8,"recall_at_10_vs_fp32":0.8055},{"name":"RankQuant b=4 asym byte-LUT","bytes_per_vec":128,"total_mib":3.662,"encode_vps":5324938.4,"p50_ms":1.6437,"p99_ms":1.6621,"gib_per_sec":2.176,"ns_per_dim":0.2140,"docs_per_sec":18251816.7,"recall_at_10_vs_fp32":0.8055},{"name":"RankQuant b=1 sym","bytes_per_vec":32,"total_mib":0.916,"encode_vps":5523695.1,"p50_ms":2.4667,"p99_ms":2.7455,"gib_per_sec":0.362,"ns_per_dim":0.3212,"docs_per_sec":12161849.9,"recall_at_10_vs_fp32":0.2785},{"name":"RankQuant b=1 asym","bytes_per_vec":32,"total_mib":0.916,"encode_vps":5523695.1,"p50_ms":2.4461,"p99_ms":2.4776,"gib_per_sec":0.366,"ns_per_dim":0.3185,"docs_per_sec":12264561.3,"recall_at_10_vs_fp32":0.3470},{"name":"Bitmap n_top=64","bytes_per_vec":32,"total_mib":0.916,"encode_vps":5576810.4,"p50_ms":0.0812,"p99_ms":0.0838,"gib_per_sec":11.017,"ns_per_dim":0.0106,"docs_per_sec":369672100.8,"recall_at_10_vs_fp32":0.2480},{"name":"SignBitmap probe","bytes_per_vec":32,"total_mib":0.916,"encode_vps":19641040.3,"p50_ms":0.0912,"p99_ms":0.0985,"gib_per_sec":9.809,"ns_per_dim":0.0119,"docs_per_sec":329124200.5,"recall_at_10_vs_fp32":0.2880},{"name":"TwoStage b=2 M=100 CR=0.976","bytes_per_vec":96,"total_mib":2.747,"encode_vps":2689552.2,"p50_ms":0.0977,"p99_ms":0.1074,"gib_per_sec":27.447,"ns_per_dim":0.0127,"docs_per_sec":306987024.7,"recall_at_10_vs_fp32":0.5700},{"name":"TwoStage b=2 M=500 CR=1.000","bytes_per_vec":96,"total_mib":2.747,"encode_vps":2669861.7,"p50_ms":0.1089,"p99_ms":0.1216,"gib_per_sec":24.622,"ns_per_dim":0.0142,"docs_per_sec":275393583.3,"recall_at_10_vs_fp32":0.5715},{"name":"TwoStage b=2 M=1000 CR=1.000","bytes_per_vec":96,"total_mib":2.747,"encode_vps":2742584.6,"p50_ms":0.1225,"p99_ms":0.1347,"gib_per_sec":21.899,"ns_per_dim":0.0159,"docs_per_sec":244937949.1,"recall_at_10_vs_fp32":0.5715},{"name":"TwoStage b=2 M=5000 CR=1.000","bytes_per_vec":96,"total_mib":2.747,"encode_vps":2674848.6,"p50_ms":0.2398,"p99_ms":0.2534,"gib_per_sec":11.185,"ns_per_dim":0.0312,"docs_per_sec":125103731.8,"recall_at_10_vs_fp32":0.5715},{"name":"SignTwoStage b=2 M=500 CR=1.000","bytes_per_vec":96,"total_mib":2.747,"encode_vps":4038492.8,"p50_ms":0.1057,"p99_ms":0.1143,"gib_per_sec":25.369,"ns_per_dim":0.0138,"docs_per_sec":283744289.6,"recall_at_10_vs_fp32":0.5715}]}
+JSON:
diff --git a/benchmarks/two_stage_caller_owned_dim1024.txt b/benchmarks/two_stage_caller_owned_dim1024.txt
index 765a3a65..f4399ebd 100644
--- a/benchmarks/two_stage_caller_owned_dim1024.txt
+++ b/benchmarks/two_stage_caller_owned_dim1024.txt
@@ -2,20 +2,26 @@ Caller-owned serial two-stage decomposition — Harrier-1024 shape (SYNTHETIC co
 Reproduce:
   cargo run --release --example two_stage_bench -- --dim 1024 --n 50000 --queries 200 --m 256 --k 10 --reps 15
 Host: AMD Ryzen 9 9950X (Zen5), AVX-512 VPOPCNTDQ, single core (taskset -c 12), single-thread.
+Toolchain: rustc 1.95.0 (59807616e 2026-04-14), release profile.
 
   dim=1024 n=50000 queries=200 m=256 k=10 bits=2 out_k=10 candidates=51200 reps=15
-  1. stage-1 candidate gen (CSR)        31.920 ms      6265.59 q/s      159.60 us/query
-  2. single-query rerank loop            2.086 ms     95858.02 q/s       10.43 us/query
-  3. batched rerank _into                2.031 ms     98463.67 q/s       10.16 us/query
-  4. full two-stage (1+3)               34.485 ms      5799.70 q/s      172.42 us/query
-  rerank speedup (batched _into vs single-query loop): 1.03x
+  (dim % 64 == 0: AVX-512 tier eligible when supported)
+  1. stage-1 candidate gen (CSR)        18.920 ms     10570.68 q/s       94.60 us/query
+  2. single-query rerank loop            1.807 ms    110656.07 q/s        9.04 us/query
+  3. batched rerank _into                1.780 ms    112367.69 q/s        8.90 us/query
+  4. full two-stage (1+3)               20.750 ms      9638.68 q/s      103.75 us/query
+  rerank speedup (batched _into vs single-query loop): 1.02x
 
-Interpretation (no-fiction): at dim=1024 the rerank stage is a small slice
-(~10 us/query) of an already-stage-1-dominated two-stage cost (~160 us/query);
-the batched _into form is on par with the single-query loop SINGLE-THREADED
-(~1.03x). The caller-owned serial primitives are NOT a single-thread speedup —
-their value is (a) allocation-free steady state (tests/alloc_free.rs proves 0
-heap allocations on a warmed _into call) and (b) caller-owned parallelism: no
-internal rayon, so a DB/runtime can drive the _into form across its own bounded
-pool (GIL released) one query-range per worker. This dim=1024 result is its own
-mechanism; it is NOT explained by the SignBitmap AVX-tail dim=768 result.
+Interpretation (no-fiction): stage-1 candidate generation now streams the
+corpus ONCE per call in L2-sized doc blocks with bounded (hamming, doc_id)
+collectors — 94.60 us/query vs 159.60 us/query for the same command on the
+same host and pinning before the 0.6.0 batch work (1.69x; full two-stage
+1.66x). Output is bit-identical (oracle-pinned in tests/tiled_candgen.rs).
+The rerank stage is unchanged in design and remains a small slice
+(~9 us/query). The caller-owned serial primitives still do NOT enter rayon
+for scan/selection — a DB/runtime drives the _into form across its own pool
+(input finite-validation of large query buffers may briefly use the global
+pool; order-independent and deterministic). Their value remains (a)
+allocation-free steady state and (b) caller-owned parallelism; at dim=1024
+the call-level scan sharing is now the dominant win and grows with the
+batch size per call.
diff --git a/fuzz/Cargo.lock b/fuzz/Cargo.lock
index 46d6639e..2f9e2cde 100644
--- a/fuzz/Cargo.lock
+++ b/fuzz/Cargo.lock
@@ -231,7 +231,7 @@ checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50"
 
 [[package]]
 name = "ordvec"
-version = "0.5.0"
+version = "0.6.0"
 dependencies = [
  "rayon",
 ]
diff --git a/ordvec-ffi/Cargo.toml b/ordvec-ffi/Cargo.toml
index 5dc9b78c..177a92ba 100644
--- a/ordvec-ffi/Cargo.toml
+++ b/ordvec-ffi/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "ordvec-ffi"
-version = "0.5.0"
+version = "0.6.0"
 edition = "2021"
 rust-version = "1.89"
 publish = false
diff --git a/ordvec-manifest-python/Cargo.toml b/ordvec-manifest-python/Cargo.toml
index 6b70394c..490ef708 100644
--- a/ordvec-manifest-python/Cargo.toml
+++ b/ordvec-manifest-python/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "ordvec-manifest-python"
-version = "0.5.0"
+version = "0.6.0"
 edition = "2021"
 rust-version = "1.89"
 description = "Python bindings for ordvec-manifest index provenance verification"
diff --git a/ordvec-manifest-python/pyproject.toml b/ordvec-manifest-python/pyproject.toml
index 090b4b75..1e2f0998 100644
--- a/ordvec-manifest-python/pyproject.toml
+++ b/ordvec-manifest-python/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "maturin"
 
 [project]
 name = "ordvec-manifest"
-version = "0.5.0"
+version = "0.6.0"
 description = "Python bindings for ordvec index manifest verification"
 readme = "README.md"
 requires-python = ">=3.10"
diff --git a/ordvec-manifest-python/python/ordvec_manifest/__init__.py b/ordvec-manifest-python/python/ordvec_manifest/__init__.py
index 20e77dd9..4d9790c5 100644
--- a/ordvec-manifest-python/python/ordvec_manifest/__init__.py
+++ b/ordvec-manifest-python/python/ordvec_manifest/__init__.py
@@ -50,4 +50,4 @@
     "create_manifest",
 ]
 
-__version__ = "0.5.0"
+__version__ = "0.6.0"
diff --git a/ordvec-manifest/Cargo.toml b/ordvec-manifest/Cargo.toml
index b00029a7..88c6de0b 100644
--- a/ordvec-manifest/Cargo.toml
+++ b/ordvec-manifest/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "ordvec-manifest"
-version = "0.5.0"
+version = "0.6.0"
 edition = "2021"
 rust-version = "1.89"
 license = "MIT OR Apache-2.0"
@@ -29,7 +29,7 @@ required-features = ["cli"]
 chrono = { version = "0.4.44", default-features = false, features = ["clock", "std"] }
 clap = { version = "4.6.1", features = ["derive"], optional = true }
 hex = "0.4.3"
-ordvec = { version = "0.5.0", path = ".." }
+ordvec = { version = "0.6.0", path = ".." }
 rusqlite = { version = "0.40.0", optional = true }
 serde = { version = "1.0", features = ["derive"] }
 serde_json = "1.0"
diff --git a/ordvec-manifest/README.md b/ordvec-manifest/README.md
index 1e3952bb..6a58c54e 100644
--- a/ordvec-manifest/README.md
+++ b/ordvec-manifest/README.md
@@ -154,11 +154,18 @@ Stable limit codes are part of the contract:
   (`row_identity_duplicate_tracking_limit_exceeded`);
 - auxiliary artifact declarations: 1,024
   (`auxiliary_artifact_count_limit_exceeded`);
-- auxiliary artifact bytes per declared file: 64 MiB
+- auxiliary artifact bytes per declared file: bounded by the
+  manifest-declared `file_size_bytes` on verify and by the observed file
+  size on create; the flat cap is an opt-in ceiling, unbounded by default
   (`auxiliary_artifact_file_too_large`);
-- calibration profile artifact bytes: 64 MiB
+- primary index artifact bytes: bounded by the manifest-declared
+  `file_size_bytes` on verify; the flat cap is an opt-in ceiling, unbounded
+  by default (`artifact_file_too_large`);
+- calibration profile artifact bytes: bounded by the declared
+  `file_size_bytes`; flat cap opt-in, unbounded by default
   (`calibration_profile_too_large`);
-- encoder distortion profile artifact bytes: 64 MiB
+- encoder distortion profile artifact bytes: bounded by the declared
+  `file_size_bytes`; flat cap opt-in, unbounded by default
   (`encoder_distortion_profile_too_large`);
 - collected report issues: 1,024, after which a
   `verification_report_issue_limit_exceeded` issue is emitted;
@@ -168,7 +175,7 @@ The CLI exposes matching override flags on `inspect`, `verify`, `create`,
 `sqlite verify`, and `sqlite activate`: `--max-manifest-bytes`,
 `--max-row-map-line-bytes`, `--max-row-map-rows`,
 `--max-row-map-tracked-id-bytes`, `--max-auxiliary-artifacts`,
-`--max-auxiliary-artifact-bytes`,
+`--max-auxiliary-artifact-bytes`, `--max-index-artifact-bytes`,
 `--max-calibration-profile-bytes`,
 `--max-encoder-distortion-profile-bytes`, `--max-report-issues`, and
 `--max-cached-report-bytes`. Library callers can override the same ceilings
@@ -184,6 +191,7 @@ Stable limit codes:
 | row-identity duplicate-tracking `db_id` bytes | `row_identity_duplicate_tracking_limit_exceeded` | `row_identity_duplicate_tracking_limit_exceeded` |
 | auxiliary artifact declarations | `auxiliary_artifact_count_limit_exceeded` | n/a |
 | auxiliary artifact bytes per declared file | `auxiliary_artifact_file_too_large` | n/a |
+| primary index artifact bytes | `artifact_file_too_large` | n/a |
 | calibration profile artifact bytes | `calibration_profile_too_large` | n/a |
 | encoder distortion profile artifact bytes | `encoder_distortion_profile_too_large` | n/a |
 | collected verification report issues | `verification_report_issue_limit_exceeded` | n/a |
diff --git a/ordvec-manifest/src/lib.rs b/ordvec-manifest/src/lib.rs
index be25a5f1..5a5ec4de 100644
--- a/ordvec-manifest/src/lib.rs
+++ b/ordvec-manifest/src/lib.rs
@@ -36,9 +36,14 @@ pub const DEFAULT_MAX_ROW_IDENTITY_JSONL_LINE_BYTES: usize = 64 * 1024;
 pub const DEFAULT_MAX_ROW_IDENTITY_ROWS: usize = 10_000_000;
 pub const DEFAULT_MAX_ROW_IDENTITY_TRACKED_DB_ID_BYTES: usize = 64 * 1024 * 1024;
 pub const DEFAULT_MAX_AUXILIARY_ARTIFACTS: usize = 1024;
-pub const DEFAULT_MAX_AUXILIARY_ARTIFACT_BYTES: u64 = 64 * 1024 * 1024;
-pub const DEFAULT_MAX_CALIBRATION_PROFILE_BYTES: u64 = 64 * 1024 * 1024;
-pub const DEFAULT_MAX_ENCODER_DISTORTION_PROFILE_BYTES: u64 = 64 * 1024 * 1024;
+/// Artifact-file reads are bounded by the manifest-declared size on verify
+/// and by the observed file size on create; these flat caps are opt-in
+/// ceilings and default to unbounded. Streaming hashing keeps memory
+/// constant regardless of artifact size.
+pub const DEFAULT_MAX_AUXILIARY_ARTIFACT_BYTES: u64 = u64::MAX;
+pub const DEFAULT_MAX_INDEX_ARTIFACT_BYTES: u64 = u64::MAX;
+pub const DEFAULT_MAX_CALIBRATION_PROFILE_BYTES: u64 = u64::MAX;
+pub const DEFAULT_MAX_ENCODER_DISTORTION_PROFILE_BYTES: u64 = u64::MAX;
 pub const DEFAULT_MAX_REPORT_ISSUES: usize = 1024;
 pub const DEFAULT_MAX_CACHED_REPORT_BYTES: u64 = 4 * 1024 * 1024;
 
@@ -253,7 +258,19 @@ fn verify_manifest_with_path_capture(
     ) {
         paths.artifact_path = Some(resolved.canonical_path.clone());
         report.artifact.canonical_path = Some(path_to_display(&resolved.canonical_path));
-        match sha256_file(&resolved.canonical_path) {
+        // Bound the read by the manifest-declared size: a primary artifact
+        // larger than its declaration fails fast instead of being hashed in
+        // full (the read was previously unbounded).
+        match sha256_file_bounded(
+            &resolved.canonical_path,
+            document
+                .manifest
+                .artifact
+                .file_size_bytes
+                .min(options.limits.max_index_artifact_bytes),
+            "artifact_file_too_large",
+            "index artifact",
+        ) {
             Ok(hash) => {
                 report.artifact.sha256 = Some(hash.sha256.clone());
                 report.artifact.size_bytes = Some(hash.size_bytes);
@@ -276,6 +293,7 @@ fn verify_manifest_with_path_capture(
                     );
                 }
             }
+            Err(ManifestError::LimitExceeded { code, message }) => report.error(code, message),
             Err(err) => report.error(
                 "artifact_hash_failed",
                 format!("failed to hash artifact: {err}"),
@@ -345,6 +363,12 @@ fn validate_manifest_shape(
             "artifact.sha256 must be a lowercase 64-character hex SHA-256 digest",
         );
     }
+    if manifest.artifact.file_size_bytes == 0 {
+        report.error(
+            "artifact_file_size_zero",
+            "artifact.file_size_bytes must be greater than zero",
+        );
+    }
     if manifest.artifact.bytes_per_vec == 0 {
         report.error(
             "artifact_bytes_per_vec_zero",
@@ -547,6 +571,17 @@ fn validate_auxiliary_artifact_shape(
                 ),
             );
         }
+        // Optional artifacts may legitimately be declared absent with a
+        // zero-size placeholder (see `AuxiliaryArtifactState::OptionalAbsent`);
+        // only required declarations must carry a real size.
+        if artifact.required && artifact.file_size_bytes == 0 {
+            report.error(
+                "auxiliary_artifact_file_size_zero",
+                format!(
+                    "required auxiliary artifact {name:?} file_size_bytes must be greater than zero"
+                ),
+            );
+        }
     }
 }
 
@@ -1223,7 +1258,9 @@ fn validate_encoder_distortion_profile_artifact(
                 Some(path_to_display(&resolved.canonical_path));
             match sha256_file_bounded(
                 &resolved.canonical_path,
-                options.limits.max_encoder_distortion_profile_bytes,
+                profile
+                    .file_size_bytes
+                    .min(options.limits.max_encoder_distortion_profile_bytes),
                 "encoder_distortion_profile_too_large",
                 "encoder distortion profile",
             ) {
@@ -1669,7 +1706,9 @@ fn validate_calibration_profile(
                 Some(path_to_display(&resolved.canonical_path));
             match sha256_file_bounded(
                 &resolved.canonical_path,
-                options.limits.max_calibration_profile_bytes,
+                profile
+                    .file_size_bytes
+                    .min(options.limits.max_calibration_profile_bytes),
                 "calibration_profile_too_large",
                 "calibration profile",
             ) {
@@ -1936,9 +1975,14 @@ fn verify_auxiliary_artifacts(
             AuxiliaryPathResolution::Resolved(resolved) => {
                 captured_path = Some(resolved.canonical_path.clone());
                 entry.canonical_path = Some(path_to_display(&resolved.canonical_path));
+                // Bound the read by the manifest-declared size (the manifest
+                // is the trust anchor; the SHA-256 pins content). A flat
+                // limit, when explicitly configured, remains a ceiling.
                 match sha256_file_bounded(
                     &resolved.canonical_path,
-                    options.limits.max_auxiliary_artifact_bytes,
+                    artifact
+                        .file_size_bytes
+                        .min(options.limits.max_auxiliary_artifact_bytes),
                     "auxiliary_artifact_file_too_large",
                     "auxiliary artifact",
                 ) {
@@ -2261,6 +2305,9 @@ pub struct ResourceLimits {
     pub max_row_identity_tracked_db_id_bytes: usize,
     pub max_auxiliary_artifacts: usize,
     pub max_auxiliary_artifact_bytes: u64,
+    /// Opt-in ceiling for the primary index artifact read (unbounded by
+    /// default; the manifest-declared size is always the effective bound).
+    pub max_index_artifact_bytes: u64,
     pub max_calibration_profile_bytes: u64,
     pub max_encoder_distortion_profile_bytes: u64,
     pub max_report_issues: usize,
@@ -2276,6 +2323,7 @@ impl Default for ResourceLimits {
             max_row_identity_tracked_db_id_bytes: DEFAULT_MAX_ROW_IDENTITY_TRACKED_DB_ID_BYTES,
             max_auxiliary_artifacts: DEFAULT_MAX_AUXILIARY_ARTIFACTS,
             max_auxiliary_artifact_bytes: DEFAULT_MAX_AUXILIARY_ARTIFACT_BYTES,
+            max_index_artifact_bytes: DEFAULT_MAX_INDEX_ARTIFACT_BYTES,
             max_calibration_profile_bytes: DEFAULT_MAX_CALIBRATION_PROFILE_BYTES,
             max_encoder_distortion_profile_bytes: DEFAULT_MAX_ENCODER_DISTORTION_PROFILE_BYTES,
             max_report_issues: DEFAULT_MAX_REPORT_ISSUES,
@@ -3432,7 +3480,11 @@ pub fn sha256_file(path: impl AsRef<Path>) -> io::Result<FileHash> {
     let mut size_bytes = 0u64;
     let mut buf = [0u8; 64 * 1024];
     loop {
-        let n = file.read(&mut buf)?;
+        let n = match file.read(&mut buf) {
+            Ok(n) => n,
+            Err(err) if err.kind() == io::ErrorKind::Interrupted => continue,
+            Err(err) => return Err(err),
+        };
         if n == 0 {
             break;
         }
@@ -3452,12 +3504,54 @@ pub fn sha256_file_bounded(
     context: &'static str,
 ) -> Result<FileHash, ManifestError> {
     let path = path.as_ref();
-    let bytes = read_bounded_file(path, max_bytes, code, context)?;
+    // Refuse non-regular files BEFORE opening: opening a FIFO read-only
+    // blocks until a writer connects, and a device node would stream
+    // forever under a large declared-size bound. Regular files terminate
+    // at EOF and are post-checked against the declaration. (A path swapped
+    // to a special file after this check is local-actor mutation, out of
+    // scope per the threat model.)
+    let metadata = fs::metadata(path)?;
+    if !metadata.is_file() {
+        return Err(ManifestError::limit_exceeded(
+            code,
+            format!("{context} is not a regular file: {}", path.display()),
+        ));
+    }
+    let mut file = File::open(path)?;
     let mut hasher = Sha256::new();
-    hasher.update(&bytes);
+    let mut size_bytes = 0u64;
+    let mut buf = [0u8; 64 * 1024];
+    loop {
+        // Strict bound: never request bytes past max_bytes + 1 (the +1
+        // detects exceedance), mirroring read_bounded_file's take() pattern.
+        let allowance = max_bytes.saturating_add(1) - size_bytes;
+        if allowance == 0 {
+            break;
+        }
+        let want = allowance.min(buf.len() as u64) as usize;
+        let n = match file.read(&mut buf[..want]) {
+            Ok(n) => n,
+            Err(err) if err.kind() == io::ErrorKind::Interrupted => continue,
+            Err(err) => return Err(err.into()),
+        };
+        if n == 0 {
+            break;
+        }
+        size_bytes += n as u64;
+        if size_bytes > max_bytes {
+            return Err(ManifestError::limit_exceeded(
+                code,
+                format!(
+                    "{context} exceeds {max_bytes} bytes while reading {}",
+                    path.display()
+                ),
+            ));
+        }
+        hasher.update(&buf[..n]);
+    }
     Ok(FileHash {
         sha256: hex::encode(hasher.finalize()),
-        size_bytes: bytes.len() as u64,
+        size_bytes,
     })
 }
 
@@ -3514,7 +3608,24 @@ pub fn create_manifest_for_index_with_options(
         fs::create_dir_all(out_base)?;
     }
     let metadata = probe_index_metadata(index_path)?;
-    let index_hash = sha256_file(index_path)?;
+    let index_hash = sha256_file_bounded(
+        index_path,
+        metadata
+            .file_size_bytes
+            .min(options.limits.max_index_artifact_bytes),
+        "artifact_file_too_large",
+        "index artifact",
+    )?;
+    // One consistent snapshot: the manifest records the byte count that was
+    // actually hashed, and any change between the metadata probe and the
+    // hash (concurrent writer) fails loudly instead of embedding a
+    // size/digest pair describing different bytes.
+    if index_hash.size_bytes != metadata.file_size_bytes {
+        return Err(ManifestError::invalid(format!(
+            "index artifact changed during manifest creation: probed {} bytes, hashed {} bytes",
+            metadata.file_size_bytes, index_hash.size_bytes
+        )));
+    }
     let kind = ManifestIndexKind::try_from_core(metadata.kind)
         .map_err(|err| ManifestError::invalid(err.message()))?;
     let params = ManifestIndexParams::try_from_core(metadata.params)
@@ -3528,7 +3639,7 @@ pub fn create_manifest_for_index_with_options(
         vector_count: metadata.vector_count,
         bytes_per_vec: metadata.bytes_per_vec,
         params,
-        file_size_bytes: metadata.file_size_bytes,
+        file_size_bytes: index_hash.size_bytes,
     };
 
     let row_identity = match row_identity {
@@ -3648,9 +3759,15 @@ fn create_auxiliary_artifacts(
                 "auxiliary artifact name {name:?} is duplicated"
             )));
         }
+        // Create is a trusted context: bound the read by the artifact's own
+        // observed size (catching mid-hash growth), not a flat cap. An
+        // explicitly configured flat limit still applies as a ceiling.
+        let observed_len = fs::metadata(&artifact.path)
+            .map_err(ManifestError::from)?
+            .len();
         let hash = sha256_file_bounded(
             &artifact.path,
-            options.limits.max_auxiliary_artifact_bytes,
+            observed_len.min(options.limits.max_auxiliary_artifact_bytes),
             "auxiliary_artifact_file_too_large",
             "auxiliary artifact",
         )?;
diff --git a/ordvec-manifest/src/main.rs b/ordvec-manifest/src/main.rs
index 6236878e..02df85c1 100644
--- a/ordvec-manifest/src/main.rs
+++ b/ordvec-manifest/src/main.rs
@@ -103,7 +103,8 @@ fn parse_auxiliary_artifact_arg(value: &str) -> Result<AuxiliaryArtifactArg, Str
 
 #[cfg(test)]
 mod tests {
-    use super::parse_auxiliary_artifact_arg;
+    use super::{parse_auxiliary_artifact_arg, Cli, Commands, LimitArgs};
+    use clap::Parser;
     use std::path::PathBuf;
 
     #[test]
@@ -112,6 +113,42 @@ mod tests {
         assert_eq!(parsed.name, "app.ids");
         assert_eq!(parsed.path, PathBuf::from("ids.bin"));
     }
+
+    #[test]
+    fn limit_args_wire_index_artifact_ceiling() {
+        let args = LimitArgs {
+            max_index_artifact_bytes: Some(42),
+            ..LimitArgs::default()
+        };
+        assert_eq!(args.resource_limits().max_index_artifact_bytes, 42);
+        // Unset flag leaves the library default (unbounded) untouched.
+        assert_eq!(
+            LimitArgs::default()
+                .resource_limits()
+                .max_index_artifact_bytes,
+            ordvec_manifest::ResourceLimits::default().max_index_artifact_bytes
+        );
+    }
+
+    #[test]
+    fn verify_accepts_max_index_artifact_bytes_flag() {
+        let cli = Cli::try_parse_from([
+            "ordvec-manifest",
+            "verify",
+            "--manifest",
+            "manifest.json",
+            "--max-index-artifact-bytes",
+            "8",
+        ])
+        .expect("flag must parse");
+        match cli.command {
+            Commands::Verify { limits, .. } => {
+                assert_eq!(limits.max_index_artifact_bytes, Some(8));
+                assert_eq!(limits.resource_limits().max_index_artifact_bytes, 8);
+            }
+            _ => panic!("expected verify command"),
+        }
+    }
 }
 
 #[cfg(feature = "sqlite")]
@@ -174,6 +211,8 @@ struct LimitArgs {
     #[arg(long)]
     max_auxiliary_artifact_bytes: Option<u64>,
     #[arg(long)]
+    max_index_artifact_bytes: Option<u64>,
+    #[arg(long)]
     max_calibration_profile_bytes: Option<u64>,
     #[arg(long)]
     max_encoder_distortion_profile_bytes: Option<u64>,
@@ -204,6 +243,9 @@ impl LimitArgs {
         if let Some(value) = self.max_auxiliary_artifact_bytes {
             limits.max_auxiliary_artifact_bytes = value;
         }
+        if let Some(value) = self.max_index_artifact_bytes {
+            limits.max_index_artifact_bytes = value;
+        }
         if let Some(value) = self.max_calibration_profile_bytes {
             limits.max_calibration_profile_bytes = value;
         }
diff --git a/ordvec-manifest/src/sqlite.rs b/ordvec-manifest/src/sqlite.rs
index 6368f9f3..6606c10a 100644
--- a/ordvec-manifest/src/sqlite.rs
+++ b/ordvec-manifest/src/sqlite.rs
@@ -1,8 +1,7 @@
 use crate::{
-    resolve_existing_path, sha256_file, sha256_file_bounded, validate_jsonl_rows,
-    verify_auxiliary_artifacts, verify_manifest, AuxiliaryArtifactState, ManifestDocument,
-    ManifestError, ReportIssue, ResourceLimits, RowIdentity, VerificationPathCapture,
-    VerificationReport, VerifyOptions,
+    resolve_existing_path, sha256_file_bounded, validate_jsonl_rows, verify_auxiliary_artifacts,
+    verify_manifest, AuxiliaryArtifactState, ManifestDocument, ManifestError, ReportIssue,
+    ResourceLimits, RowIdentity, VerificationPathCapture, VerificationReport, VerifyOptions,
 };
 use chrono::{SecondsFormat, Utc};
 use rusqlite::{params, Connection, OptionalExtension};
@@ -399,7 +398,18 @@ fn current_cache_key(
     ) else {
         return Ok(None);
     };
-    let artifact_sha256 = match sha256_file(&artifact.canonical_path) {
+    // Bound the cache-key hash exactly like the verify path: declared size
+    // with the opt-in ceiling. A bound violation just misses the cache.
+    let artifact_sha256 = match sha256_file_bounded(
+        &artifact.canonical_path,
+        document
+            .manifest
+            .artifact
+            .file_size_bytes
+            .min(options.limits.max_index_artifact_bytes),
+        "artifact_file_too_large",
+        "index artifact",
+    ) {
         Ok(hash) => hash.sha256,
         Err(_) => return Ok(None),
     };
@@ -618,7 +628,9 @@ fn current_calibration_profile_sha256(
     };
     match sha256_file_bounded(
         &resolved.canonical_path,
-        options.limits.max_calibration_profile_bytes,
+        profile
+            .file_size_bytes
+            .min(options.limits.max_calibration_profile_bytes),
         "calibration_profile_too_large",
         "calibration profile",
     ) {
@@ -652,7 +664,9 @@ fn current_encoder_distortion_profile_sha256(
     };
     match sha256_file_bounded(
         &resolved.canonical_path,
-        options.limits.max_encoder_distortion_profile_bytes,
+        profile
+            .file_size_bytes
+            .min(options.limits.max_encoder_distortion_profile_bytes),
         "encoder_distortion_profile_too_large",
         "encoder distortion profile",
     ) {
diff --git a/ordvec-manifest/tests/derived_limits.rs b/ordvec-manifest/tests/derived_limits.rs
new file mode 100644
index 00000000..c08aa18c
--- /dev/null
+++ b/ordvec-manifest/tests/derived_limits.rs
@@ -0,0 +1,248 @@
+//! Derived artifact size bounds: create bounds reads by the artifact's own
+//! observed size, verify bounds reads by the manifest-declared size. The flat
+//! `ResourceLimits` byte caps remain enforceable as explicit opt-in ceilings
+//! but no longer reject large legitimate artifacts by default.
+
+use ordvec::RankQuant;
+use ordvec_manifest::{
+    create_manifest_for_index, create_manifest_for_index_with_options, verify_manifest_with_base,
+    CreateAuxiliaryArtifact, CreateManifestOptions, CreateRowIdentity, VerificationReport,
+    VerifyOptions,
+};
+use std::fs;
+use std::fs::OpenOptions;
+use std::io::Write;
+use std::path::{Path, PathBuf};
+
+const LEGACY_AUX_CAP: u64 = 64 * 1024 * 1024;
+
+fn write_index(dir: &Path) -> PathBuf {
+    let path = dir.join("index.ovrq");
+    let mut index = RankQuant::new(16, 2);
+    let docs: Vec<f32> = (0..32).map(|i| i as f32 - 12.0).collect();
+    index.add(&docs);
+    index.write(&path).unwrap();
+    path
+}
+
+fn error_codes(report: &VerificationReport) -> Vec<&str> {
+    report
+        .errors
+        .iter()
+        .map(|issue| issue.code.as_str())
+        .collect()
+}
+
+fn create_with_aux(dir: &Path, aux_path: &Path) -> (ordvec_manifest::IndexManifest, PathBuf) {
+    let index = write_index(dir);
+    let manifest_path = dir.join("manifest.json");
+    let manifest = create_manifest_for_index_with_options(
+        &index,
+        CreateRowIdentity::RowIdIdentity,
+        "test-embedding",
+        &manifest_path,
+        CreateManifestOptions {
+            auxiliary_artifacts: vec![CreateAuxiliaryArtifact {
+                name: "sidecar".to_string(),
+                path: aux_path.to_path_buf(),
+                required: true,
+            }],
+            ..CreateManifestOptions::default()
+        },
+    )
+    .unwrap();
+    (manifest, manifest_path)
+}
+
+/// Default options must accept auxiliary artifacts larger than the legacy
+/// 64 MiB flat cap, end to end: create records the artifact, verify passes.
+/// (A 1.26M-row dim=1024 sign sidecar is ~161 MB; the default cap made such
+/// bundles impossible to write.)
+#[test]
+fn default_limits_accept_aux_artifact_above_legacy_cap() {
+    let temp = tempfile::tempdir().unwrap();
+    let aux_path = temp.path().join("sidecar.bin");
+    let aux_len = LEGACY_AUX_CAP + 4096;
+    let file = fs::File::create(&aux_path).unwrap();
+    file.set_len(aux_len).unwrap();
+    drop(file);
+
+    let (manifest, _) = create_with_aux(temp.path(), &aux_path);
+    assert_eq!(manifest.auxiliary_artifacts.len(), 1);
+    assert_eq!(manifest.auxiliary_artifacts[0].file_size_bytes, aux_len);
+
+    let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default());
+    assert_eq!(
+        error_codes(&report),
+        Vec::<&str>::new(),
+        "expected clean verification for a {aux_len}-byte auxiliary artifact under defaults",
+    );
+}
+
+/// An auxiliary artifact that grew after manifest creation must be rejected
+/// by the declared-size read bound (fail-fast, without hashing the excess),
+/// keeping the established `auxiliary_artifact_file_too_large` reason code.
+#[test]
+fn verify_bounds_aux_read_by_declared_size_when_grown() {
+    let temp = tempfile::tempdir().unwrap();
+    let aux_path = temp.path().join("sidecar.bin");
+    fs::write(&aux_path, vec![7u8; 8192]).unwrap();
+
+    let (manifest, _) = create_with_aux(temp.path(), &aux_path);
+
+    let mut file = OpenOptions::new().append(true).open(&aux_path).unwrap();
+    file.write_all(&[7u8; 4096]).unwrap();
+    drop(file);
+
+    let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default());
+    assert!(
+        error_codes(&report).contains(&"auxiliary_artifact_file_too_large"),
+        "grown artifact must fail the declared-size bound, got {:?}",
+        error_codes(&report),
+    );
+    assert_eq!(
+        report.auxiliary_artifacts[0].reason_code.as_deref(),
+        Some("auxiliary_artifact_file_too_large"),
+    );
+}
+
+/// Regression guard: a truncated auxiliary artifact still fails verification
+/// (size mismatch below the declared bound; the bound itself must not
+/// misclassify a smaller-than-declared file).
+#[test]
+fn verify_rejects_truncated_aux_artifact() {
+    let temp = tempfile::tempdir().unwrap();
+    let aux_path = temp.path().join("sidecar.bin");
+    fs::write(&aux_path, vec![7u8; 8192]).unwrap();
+
+    let (manifest, _) = create_with_aux(temp.path(), &aux_path);
+    let file = OpenOptions::new().write(true).open(&aux_path).unwrap();
+    file.set_len(4096).unwrap();
+    drop(file);
+
+    let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default());
+    assert!(
+        error_codes(&report).contains(&"auxiliary_artifact_file_size_mismatch"),
+        "truncated artifact must fail size equality, got {:?}",
+        error_codes(&report),
+    );
+}
+
+/// Regression guard: a manifest whose declared auxiliary size was inflated
+/// (bytes on disk unchanged) still fails the size-equality check even though
+/// the SHA-256 matches.
+#[test]
+fn verify_rejects_inflated_declared_aux_size() {
+    let temp = tempfile::tempdir().unwrap();
+    let aux_path = temp.path().join("sidecar.bin");
+    fs::write(&aux_path, vec![7u8; 8192]).unwrap();
+
+    let (mut manifest, _) = create_with_aux(temp.path(), &aux_path);
+    manifest.auxiliary_artifacts[0].file_size_bytes = 1 << 30;
+
+    let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default());
+    assert!(
+        error_codes(&report).contains(&"auxiliary_artifact_file_size_mismatch"),
+        "inflated declaration must fail size equality, got {:?}",
+        error_codes(&report),
+    );
+}
+
+/// An explicitly configured flat cap remains an enforceable ceiling on
+/// verify even when the declared size is within bounds.
+#[test]
+fn explicit_flat_cap_still_enforced_on_verify() {
+    let temp = tempfile::tempdir().unwrap();
+    let aux_path = temp.path().join("sidecar.bin");
+    fs::write(&aux_path, vec![7u8; 8192]).unwrap();
+
+    let (manifest, _) = create_with_aux(temp.path(), &aux_path);
+    let mut options = VerifyOptions::default();
+    options.limits.max_auxiliary_artifact_bytes = 4096;
+
+    let report = verify_manifest_with_base(manifest, temp.path(), options);
+    assert!(
+        error_codes(&report).contains(&"auxiliary_artifact_file_too_large"),
+        "explicit tight cap must still reject, got {:?}",
+        error_codes(&report),
+    );
+}
+
+/// The primary index artifact gains a declared-size read bound: a primary
+/// artifact that grew after manifest creation fails fast with a dedicated
+/// reason code instead of being hashed in full.
+#[test]
+fn verify_bounds_primary_read_by_declared_size_when_grown() {
+    let temp = tempfile::tempdir().unwrap();
+    let index = write_index(temp.path());
+    let manifest_path = temp.path().join("manifest.json");
+    let manifest = create_manifest_for_index(
+        &index,
+        CreateRowIdentity::RowIdIdentity,
+        "test-embedding",
+        &manifest_path,
+    )
+    .unwrap();
+
+    let mut file = OpenOptions::new().append(true).open(&index).unwrap();
+    file.write_all(&[0u8; 4096]).unwrap();
+    drop(file);
+
+    let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default());
+    assert!(
+        error_codes(&report).contains(&"artifact_file_too_large"),
+        "grown primary artifact must fail the declared-size bound, got {:?}",
+        error_codes(&report),
+    );
+}
+
+/// The primary index artifact honors an explicitly configured opt-in
+/// ceiling, mirroring the auxiliary/profile artifact classes (CIPHER-02).
+#[test]
+fn explicit_index_ceiling_enforced_on_primary() {
+    let temp = tempfile::tempdir().unwrap();
+    let index = write_index(temp.path());
+    let manifest_path = temp.path().join("manifest.json");
+    let manifest = create_manifest_for_index(
+        &index,
+        CreateRowIdentity::RowIdIdentity,
+        "test-embedding",
+        &manifest_path,
+    )
+    .unwrap();
+
+    let mut options = VerifyOptions::default();
+    options.limits.max_index_artifact_bytes = 8;
+
+    let report = verify_manifest_with_base(manifest, temp.path(), options);
+    assert!(
+        error_codes(&report).contains(&"artifact_file_too_large"),
+        "explicit index ceiling must reject, got {:?}",
+        error_codes(&report),
+    );
+}
+
+/// Non-regular files must be refused before hashing: a FIFO would stream
+/// forever under a large declared-size bound (CIPHER-001).
+#[cfg(unix)]
+#[test]
+fn verify_refuses_non_regular_artifact_files() {
+    let temp = tempfile::tempdir().unwrap();
+    let aux_path = temp.path().join("sidecar.bin");
+    fs::write(&aux_path, vec![7u8; 512]).unwrap();
+    let (manifest, _) = create_with_aux(temp.path(), &aux_path);
+
+    fs::remove_file(&aux_path).unwrap();
+    let status = std::process::Command::new("mkfifo")
+        .arg(&aux_path)
+        .status()
+        .unwrap();
+    assert!(status.success());
+
+    let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default());
+    assert!(
+        error_codes(&report).contains(&"auxiliary_artifact_file_too_large"),
+        "FIFO artifact must be refused, got {:?}",
+        error_codes(&report),
+    );
+}
diff --git a/ordvec-manifest/tests/manifest.rs b/ordvec-manifest/tests/manifest.rs
index 3a583e27..71f9b57a 100644
--- a/ordvec-manifest/tests/manifest.rs
+++ b/ordvec-manifest/tests/manifest.rs
@@ -2272,19 +2272,17 @@ fn verify_for_load_fails_closed_with_report_for_corrupted_artifact() {
         serde_json::to_string_pretty(&manifest).unwrap(),
     )
     .unwrap();
-    fs::OpenOptions::new()
-        .append(true)
-        .open(&index)
-        .unwrap()
-        .write_all(b"\0")
-        .unwrap();
+    // Corrupt in place (same size): the declared-size read bound is
+    // satisfied, so verification proceeds to the digest and fails there.
+    let mut bytes = fs::read(&index).unwrap();
+    bytes[0] ^= 0xFF;
+    fs::write(&index, &bytes).unwrap();
 
     let err = verify_for_load(&manifest_path, VerifyOptions::default()).unwrap_err();
     let VerifiedLoadPlanError::VerificationFailed(report) = err else {
         panic!("expected verification failure");
     };
     assert!(error_codes(&report).contains(&"artifact_sha256_mismatch"));
-    assert!(error_codes(&report).contains(&"artifact_file_size_mismatch"));
 }
 
 #[test]
@@ -2328,8 +2326,9 @@ fn verify_for_load_plan_is_not_a_byte_pin() {
     let VerifiedLoadPlanError::VerificationFailed(report) = err else {
         panic!("expected verification failure");
     };
-    assert!(error_codes(&report).contains(&"artifact_sha256_mismatch"));
-    assert!(error_codes(&report).contains(&"artifact_file_size_mismatch"));
+    // The artifact grew past its declared size, so re-verification fails
+    // fast at the declared-size read bound.
+    assert!(error_codes(&report).contains(&"artifact_file_too_large"));
 }
 
 #[test]
@@ -2640,6 +2639,49 @@ fn auxiliary_artifacts_fail_closed_on_tamper_missing_and_path_escape() {
         .ends_with("missing.bin"));
 }
 
+#[test]
+fn manifest_shape_rejects_zero_declared_file_sizes_for_required_artifacts() {
+    let root = tempfile::tempdir().unwrap();
+    let (temp, mut manifest, _manifest_path) = identity_manifest(root.path());
+    fs::write(temp.path().join("extra.bin"), b"extra").unwrap();
+    let extra_hash = sha256_file(temp.path().join("extra.bin")).unwrap();
+
+    manifest.artifact.file_size_bytes = 0;
+    manifest.auxiliary_artifacts = vec![AuxiliaryArtifact {
+        name: "extra".to_string(),
+        path: "extra.bin".to_string(),
+        sha256: extra_hash.sha256,
+        file_size_bytes: 0,
+        required: true,
+    }];
+
+    let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default());
+    assert!(!report.ok);
+    let codes = error_codes(&report);
+    assert!(codes.contains(&"artifact_file_size_zero"), "{codes:?}");
+    assert!(
+        codes.contains(&"auxiliary_artifact_file_size_zero"),
+        "{codes:?}"
+    );
+}
+
+#[test]
+fn optional_absent_zero_size_placeholder_is_not_flagged_zero_size() {
+    let root = tempfile::tempdir().unwrap();
+    let (temp, mut manifest, _manifest_path) = identity_manifest(root.path());
+    manifest.auxiliary_artifacts = vec![AuxiliaryArtifact {
+        name: "optional-model".to_string(),
+        path: "missing-model.json".to_string(),
+        sha256: "0".repeat(64),
+        file_size_bytes: 0,
+        required: false,
+    }];
+
+    let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default());
+    assert!(report.ok, "{:?}", report.errors);
+    assert!(!error_codes(&report).contains(&"auxiliary_artifact_file_size_zero"));
+}
+
 #[test]
 fn auxiliary_artifact_schema_rejects_unknown_fields_and_duplicate_names() {
     let root = tempfile::tempdir().unwrap();
@@ -4255,3 +4297,70 @@ fn sqlite_cache_key_includes_limits_and_bounds_cached_report_size() {
     .unwrap_err();
     assert_eq!(err.code(), Some("sqlite_cached_report_too_large"));
 }
+
+#[test]
+fn grown_profiles_fail_fast_at_declared_size_under_default_limits() {
+    // Derived-limits regression coverage for the two profile call sites:
+    // a profile grown past its manifest-declared size must fail fast with
+    // the *_too_large code at DEFAULT options (bound = declared size), not
+    // be hashed in full and reported as a digest mismatch.
+    let temp = tempfile::tempdir().unwrap();
+    let case = tempfile::tempdir_in(temp.path()).unwrap();
+    let profile_dir = case.path().join("profiles");
+    fs::create_dir(&profile_dir).unwrap();
+    let index = write_index_kind(case.path(), FixtureKind::Bitmap);
+    let manifest_path = case.path().join("manifest.json");
+    let mut manifest = create_manifest_for_index(
+        &index,
+        CreateRowIdentity::RowIdIdentity,
+        "test-embedding",
+        &manifest_path,
+    )
+    .unwrap();
+
+    let calibration_path = profile_dir.join("profile.f64");
+    let calibration_hash = write_profile(
+        &calibration_path,
+        manifest.artifact.dim * std::mem::size_of::<f64>(),
+    );
+    manifest.calibration = Some(weighted_calibration(
+        &manifest,
+        "profiles/profile.f64",
+        calibration_hash,
+        CalibrationOrdinalization::TopK {
+            dim: manifest.artifact.dim,
+            k: 16,
+        },
+        ProfileParameterization::MarginalTopKFrequency,
+        vec![manifest.artifact.dim],
+    ));
+
+    let distortion_path = profile_dir.join("distortion.json");
+    let distortion_hash = write_profile(&distortion_path, 128);
+    manifest.encoder_distortion = Some(distortion_profile(
+        &manifest,
+        Some("profiles/distortion.json".to_string()),
+        Some(distortion_hash),
+        DistortionEvidenceKind::EmpiricalSample,
+    ));
+
+    let report = verify_manifest_with_base(manifest.clone(), case.path(), VerifyOptions::default());
+    assert!(report.ok, "{:?}", report.errors);
+
+    // Grow both profile files past their declarations.
+    for path in [&calibration_path, &distortion_path] {
+        let mut file = fs::OpenOptions::new().append(true).open(path).unwrap();
+        file.write_all(&[0u8; 64]).unwrap();
+    }
+
+    let report = verify_manifest_with_base(manifest, case.path(), VerifyOptions::default());
+    let codes = error_codes(&report);
+    assert!(
+        codes.contains(&"calibration_profile_too_large"),
+        "grown calibration profile must fail the declared-size bound, got {codes:?}",
+    );
+    assert!(
+        codes.contains(&"encoder_distortion_profile_too_large"),
+        "grown encoder distortion profile must fail the declared-size bound, got {codes:?}",
+    );
+}
diff --git a/ordvec-python/Cargo.toml b/ordvec-python/Cargo.toml
index 174fe13f..fb0a3cd4 100644
--- a/ordvec-python/Cargo.toml
+++ b/ordvec-python/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "ordvec-python"
-version = "0.5.0"
+version = "0.6.0"
 edition = "2021"
 rust-version = "1.89" # inherits ordvec's AVX-512 MSRV floor
 description = "Python bindings for ordvec — training-free ordinal & sign vector quantization"
diff --git a/ordvec-python/pyproject.toml b/ordvec-python/pyproject.toml
index d627aa12..79b26f75 100644
--- a/ordvec-python/pyproject.toml
+++ b/ordvec-python/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "maturin"
 
 [project]
 name = "ordvec"
-version = "0.5.0"
+version = "0.6.0"
 description = "Training-free ordinal & sign quantization for compressed vector retrieval"
 readme = "README.md"
 requires-python = ">=3.10"
diff --git a/ordvec-python/python/ordvec/__init__.py b/ordvec-python/python/ordvec/__init__.py
index 4726b895..5cfa2911 100644
--- a/ordvec-python/python/ordvec/__init__.py
+++ b/ordvec-python/python/ordvec/__init__.py
@@ -115,4 +115,4 @@
     "SignBitmapIndex",
 ]
 
-__version__ = "0.5.0"
+__version__ = "0.6.0"
diff --git a/src/quant.rs b/src/quant.rs
index 1c8ce627..321f748b 100644
--- a/src/quant.rs
+++ b/src/quant.rs
@@ -33,7 +33,7 @@ use crate::quant_kernels::{
     scan_b2_asym_avx2, scan_b2_asym_avx512, scan_b4_asym_avx2, scan_b4_asym_avx512,
 };
 use crate::rank::{
-    bucket_centre, bucket_ranks, pack_buckets, rank_to_bucket, rank_transform,
+    bucket_centre, bucket_ranks, pack_buckets, rank_to_bucket, rank_transform, rank_transform_into,
     rankquant_bytes_per_vec, rankquant_norm,
 };
 use crate::sign_bitmap::SignBitmap;
@@ -601,12 +601,15 @@ impl RankQuant {
         self.packed[start..]
             .par_chunks_mut(bytes_per_vec)
             .zip(vectors.par_chunks(dim))
-            .for_each(|(out, v)| {
-                let ranks = rank_transform(v);
-                let buckets = bucket_ranks(&ranks, bits);
-                let packed = pack_buckets(&buckets, bits);
-                out.copy_from_slice(&packed);
-            });
+            .for_each_init(
+                || vec![0u16; dim],
+                |ranks, (out, v)| {
+                    rank_transform_into(v, ranks);
+                    let buckets = bucket_ranks(ranks, bits);
+                    let packed = pack_buckets(&buckets, bits);
+                    out.copy_from_slice(&packed);
+                },
+            );
         self.n_vectors = new_n;
     }
 
diff --git a/src/rank_io.rs b/src/rank_io.rs
index e05505c8..d0b29572 100644
--- a/src/rank_io.rs
+++ b/src/rank_io.rs
@@ -765,8 +765,39 @@ fn load_rankquant_from_stream<R: Read + Seek>(
     let expected_per_bucket = dim / n_buckets;
     let mask = (1u8 << bits) - 1;
     let bits_u = bits as usize;
-    for (row_idx, row) in packed.chunks_exact(bytes_per_row).enumerate() {
-        let mut hist = [0usize; 16]; // n_buckets <= 2^4 = 16
+    // Per-byte bucket-count LUT: byte value -> how many of its packed codes
+    // land in each bucket. Replaces the per-code shift/mask loop (dim ops
+    // per row) with bytes_per_row table lookups, and rows check in parallel
+    // (they are independent). `find_first` preserves the serial contract of
+    // reporting the lowest offending row.
+    let mut lut = [[0u8; 16]; 256];
+    for (byte, counts) in lut.iter_mut().enumerate() {
+        for slot in 0..codes_per_byte {
+            let shift = (codes_per_byte - 1 - slot) * bits_u;
+            counts[((byte as u8 >> shift) & mask) as usize] += 1;
+        }
+    }
+    let row_is_valid = |row: &[u8]| {
+        let mut hist = [0u16; 16];
+        for &byte in row {
+            let counts = &lut[byte as usize];
+            for bucket in 0..n_buckets {
+                hist[bucket] += u16::from(counts[bucket]);
+            }
+        }
+        hist[..n_buckets]
+            .iter()
+            .all(|&count| count as usize == expected_per_bucket)
+    };
+    use rayon::prelude::*;
+    let first_bad = (0..n_vectors).into_par_iter().find_first(|&row_idx| {
+        !row_is_valid(&packed[row_idx * bytes_per_row..(row_idx + 1) * bytes_per_row])
+    });
+    if let Some(row_idx) = first_bad {
+        // Rerun the scalar histogram on the offending row for the exact
+        // bucket/count in the error message.
+        let row = &packed[row_idx * bytes_per_row..(row_idx + 1) * bytes_per_row];
+        let mut hist = [0usize; 16];
         for &byte in row {
             for slot in 0..codes_per_byte {
                 let shift = (codes_per_byte - 1 - slot) * bits_u;
@@ -781,6 +812,7 @@ fn load_rankquant_from_stream<R: Read + Seek>(
                 )));
             }
         }
+        unreachable!("row {row_idx} failed the LUT check but passed the scalar recheck");
     }
     Ok((bits, dim, n_vectors, packed))
 }
diff --git a/src/sign_bitmap.rs b/src/sign_bitmap.rs
index 66f971ab..649443d0 100644
--- a/src/sign_bitmap.rs
+++ b/src/sign_bitmap.rs
@@ -39,6 +39,7 @@
 //! scalar path. See [`crate::avx512vpop_supported`].
 
 use rayon::prelude::*;
+use std::collections::BinaryHeap;
 
 use crate::OrdvecError;
 
@@ -220,6 +221,112 @@ impl SignBitmap {
     /// SIMD dispatch paths — same audit discipline as
     /// [`crate::Bitmap::top_m_candidates`].
     #[must_use = "this scans the corpus to generate candidates; dropping the result discards that work"]
+    /// Streamed exact top-m selection shared by [`Self::top_m_candidates`]
+    /// and [`Self::top_m_candidates_batched_serial_csr`]: the corpus is
+    /// scanned once per call in L2-sized doc blocks, each hot block is
+    /// scored against every query (in small query tiles), and per-query
+    /// bounded min-m collectors keyed by `(hamming, doc_id)` select exactly
+    /// the lexicographic top-m — bit-identical to a full sort, independent
+    /// of processing order. Serial by contract: no rayon.
+    fn top_m_candidates_streamed(&self, queries: &[f32], m_eff: usize) -> Vec<Vec<u32>> {
+        const TILE_QUERIES: usize = 32;
+        const BLOCK_BYTES: usize = 256 * 1024;
+
+        let dim = self.dim;
+        debug_assert!(
+            queries.len().is_multiple_of(dim),
+            "queries buffer must be a whole number of rows"
+        );
+        let nq = queries.len() / dim;
+        let qpv = self.qwords_per_vec;
+        let n = self.n_vectors;
+        debug_assert!(m_eff >= 1 && m_eff <= n);
+
+        // Build bitmaps in place: the entry points already validated the
+        // whole query buffer, and build_query_bitmap would allocate a fresh
+        // Vec (and re-validate) per query on this hot path.
+        let mut q_bitmaps = vec![0u64; nq * qpv];
+        for qi in 0..nq {
+            let q = &queries[qi * dim..(qi + 1) * dim];
+            let bm = &mut q_bitmaps[qi * qpv..(qi + 1) * qpv];
+            for (j, &value) in q.iter().enumerate() {
+                if value > 0.0 {
+                    bm[j / 64] |= 1u64 << (j % 64);
+                }
+            }
+        }
+
+        let block_docs = (BLOCK_BYTES / (qpv * 8)).max(64).min(n);
+        let tile = TILE_QUERIES.min(nq);
+        let mut block_scores = vec![0u32; tile * block_docs];
+        // Max-heap keeps the current worst kept key at the top, so the
+        // retained set is always the m lexicographically smallest
+        // (hamming, doc_id) keys seen so far.
+        // Selection state is O(nq * m_eff) on top of the CSR output — an
+        // explicit checked bound (32-bit/wasm32 targets can overflow the
+        // multiplication) with a clear message, per the crate's
+        // checked-allocation discipline. Exact per-heap reservation of
+        // m_eff + 1 is deliberate: gradual growth would double-allocate to
+        // the next power of two (~2x m_eff peak per query); callers with
+        // extreme nq * m_eff should tile the query batch (as OrdinalDB's
+        // chunk scheduler does).
+        let selection_cells = nq.checked_mul(m_eff).unwrap_or_else(|| {
+            panic!("selection state nq ({nq}) * m ({m_eff}) overflows usize; tile the query batch")
+        });
+        let _ = selection_cells;
+        let mut heaps: Vec<BinaryHeap<(u32, u32)>> = (0..nq)
+            .map(|_| BinaryHeap::with_capacity(m_eff + 1))
+            .collect();
+        // Cached copy of each full heap's worst kept hamming. Doc ids visit
+        // each heap strictly ascending (d ascends within a row, blocks
+        // ascend), so a candidate tying the worst hamming always loses the
+        // (hamming, doc_id) tie-break — once full, the boundary test
+        // reduces to one u32 compare against this register. u32::MAX while
+        // filling (hamming <= dim can never reach it).
+        let mut worst_bounds = vec![u32::MAX; nq];
+
+        let mut block_start = 0usize;
+        while block_start < n {
+            let bn = block_docs.min(n - block_start);
+            let block = &self.bitmaps[block_start * qpv..(block_start + bn) * qpv];
+            let mut tile_start = 0usize;
+            while tile_start < nq {
+                let tq = tile.min(nq - tile_start);
+                let qb_tile = &q_bitmaps[tile_start * qpv..(tile_start + tq) * qpv];
+                let scores = &mut block_scores[..tq * bn];
+                sign_scan_collect_batched(block, bn, qpv, qb_tile, tq, scores);
+                for ti in 0..tq {
+                    let heap = &mut heaps[tile_start + ti];
+                    let worst = &mut worst_bounds[tile_start + ti];
+                    let row = &scores[ti * bn..(ti + 1) * bn];
+                    for (d, &hamming) in row.iter().enumerate() {
+                        if hamming >= *worst {
+                            continue;
+                        }
+                        heap.push((hamming, (block_start + d) as u32));
+                        if heap.len() > m_eff {
+                            heap.pop();
+                        }
+                        if heap.len() == m_eff {
+                            *worst = heap.peek().expect("full collector").0;
+                        }
+                    }
+                }
+                tile_start += tq;
+            }
+            block_start += bn;
+        }
+
+        heaps
+            .into_iter()
+            .map(|heap| {
+                let mut kept = heap.into_vec();
+                kept.sort_unstable();
+                kept.into_iter().map(|(_, doc)| doc).collect()
+            })
+            .collect()
+    }
+
     pub fn top_m_candidates(&self, q: &[f32], m: usize) -> Vec<u32> {
         assert_eq!(q.len(), self.dim);
         crate::util::assert_all_finite(q);
@@ -227,6 +334,10 @@ impl SignBitmap {
         if m_eff == 0 {
             return Vec::new();
         }
+        // Single-query stays on the dense partition path: with one query
+        // there is no scan to share, and select_nth_unstable_by (O(n)
+        // average) measurably beats an O(n log m) bounded heap for m in the
+        // hundreds at small/medium n (audit: +50-90% regression otherwise).
         let qb = self.build_query_bitmap(q);
         let mut scores = vec![0u32; self.n_vectors]; // Hamming distance per doc
         sign_scan_collect(
@@ -313,10 +424,17 @@ impl SignBitmap {
     /// pool. (The existing [`Self::top_m_candidates_batched`] remains the
     /// internally-parallel standalone convenience.)
     ///
-    /// Track-1 implementation is intentionally naive — it loops the single-query
-    /// [`Self::top_m_candidates`] (which materialises a per-query `n` Hamming
-    /// row). A future release may replace the internals with streaming top-m
-    /// behind this frozen signature; the CSR output contract will not change.
+    /// The internals stream the corpus **once per call** in L2-sized doc
+    /// blocks, scoring every query of the call against each hot block and
+    /// selecting per-query top-m with bounded `(hamming, doc_id)` collectors
+    /// — per-query corpus traffic drops by the call's query count relative
+    /// to the historical per-query rescan. The CSR output contract is
+    /// unchanged and bit-identical to the previous implementation.
+    ///
+    /// "Serial" scopes the scan and selection: no rayon is entered for the
+    /// candidate work, so callers own that parallelism. Input finite-
+    /// validation MAY briefly use the global rayon pool for large query
+    /// buffers (order-independent boolean reduction; deterministic).
     ///
     /// # Example
     /// ```no_run
@@ -344,10 +462,17 @@ impl SignBitmap {
         let m_eff = m.min(self.n_vectors);
         let mut offsets = Vec::with_capacity(nq + 1);
         offsets.push(0usize);
-        let mut candidates = Vec::with_capacity(nq.saturating_mul(m_eff));
-        for qi in 0..nq {
-            let q = &queries[qi * dim..(qi + 1) * dim];
-            let row = self.top_m_candidates(q, m);
+        let mut candidates = Vec::with_capacity(nq.checked_mul(m_eff).unwrap_or_else(|| {
+            panic!("CSR output nq ({nq}) * m ({m_eff}) overflows usize; tile the query batch")
+        }));
+        if nq == 0 || m_eff == 0 {
+            offsets.extend(std::iter::repeat_n(0usize, nq));
+            return CandidateBatch {
+                candidates,
+                offsets,
+            };
+        }
+        for row in self.top_m_candidates_streamed(queries, m_eff) {
             candidates.extend_from_slice(&row);
             offsets.push(candidates.len());
         }
@@ -662,6 +787,59 @@ fn sign_scan_collect_batched(
     }
 }
 
+/// Fold eight u64-lane accumulators into one vector holding their eight
+/// horizontal sums, in accumulator order: an unpack/permute/shuffle tree
+/// (25 vector ops) replacing eight serial `_mm512_reduce_add_epi64`
+/// expansions on the per-doc hot path.
+#[cfg(target_arch = "x86_64")]
+#[target_feature(enable = "avx512f")]
+unsafe fn hsum8_epi64_avx512(accs: &[std::arch::x86_64::__m512i; 8]) -> std::arch::x86_64::__m512i {
+    use std::arch::x86_64::*;
+    {
+        // L1: pairwise lane sums, interleaved per source:
+        // s01 = [a0p01, a1p01, a0p23, a1p23, a0p45, a1p45, a0p67, a1p67]
+        let s01 = _mm512_add_epi64(
+            _mm512_unpacklo_epi64(accs[0], accs[1]),
+            _mm512_unpackhi_epi64(accs[0], accs[1]),
+        );
+        let s23 = _mm512_add_epi64(
+            _mm512_unpacklo_epi64(accs[2], accs[3]),
+            _mm512_unpackhi_epi64(accs[2], accs[3]),
+        );
+        let s45 = _mm512_add_epi64(
+            _mm512_unpacklo_epi64(accs[4], accs[5]),
+            _mm512_unpackhi_epi64(accs[4], accs[5]),
+        );
+        let s67 = _mm512_add_epi64(
+            _mm512_unpacklo_epi64(accs[6], accs[7]),
+            _mm512_unpackhi_epi64(accs[6], accs[7]),
+        );
+        // L2: gather even/odd u64s across pair vectors:
+        // e01_23 = [a0p01, a0p23, a0p45, a0p67, a2p01, a2p23, a2p45, a2p67]
+        let even_idx = _mm512_setr_epi64(0, 2, 4, 6, 8, 10, 12, 14);
+        let odd_idx = _mm512_setr_epi64(1, 3, 5, 7, 9, 11, 13, 15);
+        let e02 = _mm512_permutex2var_epi64(s01, even_idx, s23);
+        let o13 = _mm512_permutex2var_epi64(s01, odd_idx, s23);
+        let e46 = _mm512_permutex2var_epi64(s45, even_idx, s67);
+        let o57 = _mm512_permutex2var_epi64(s45, odd_idx, s67);
+        // L3: pairwise again ->
+        // w1 = [a0p0123, a1p0123, a0p4567, a1p4567, a2p0123, a3p0123, a2p4567, a3p4567]
+        let w1 = _mm512_add_epi64(
+            _mm512_unpacklo_epi64(e02, o13),
+            _mm512_unpackhi_epi64(e02, o13),
+        );
+        let w2 = _mm512_add_epi64(
+            _mm512_unpacklo_epi64(e46, o57),
+            _mm512_unpackhi_epi64(e46, o57),
+        );
+        // L4: fold 128-bit blocks: w1 blocks B0=[a0p0123,a1p0123]
+        // B1=[a0p4567,a1p4567] B2=[a2..],B3 -> sums = B0+B1, B2+B3.
+        let t = _mm512_shuffle_i64x2(w1, w2, 0b10_00_10_00);
+        let u = _mm512_shuffle_i64x2(w1, w2, 0b11_01_11_01);
+        _mm512_add_epi64(t, u)
+    }
+}
+
 #[cfg(target_arch = "x86_64")]
 #[target_feature(enable = "avx512f,avx512vpopcntdq")]
 unsafe fn sign_scan_collect_batched_avx512vpop(
@@ -734,9 +912,11 @@ unsafe fn sign_scan_collect_batched_avx512vpop(
                         accs[bi] = _mm512_add_epi64(accs[bi], _mm512_popcnt_epi64(xor_zmm));
                     }
                 }
+                let sums = hsum8_epi64_avx512(&accs);
+                let mut sums_arr = [0u64; CHUNK];
+                _mm512_storeu_si512(sums_arr.as_mut_ptr() as *mut __m512i, sums);
                 for bi in 0..CHUNK {
-                    let acc_sum: i64 = _mm512_reduce_add_epi64(accs[bi]);
-                    scores[(chunk_start + bi) * n + di] = acc_sum as u32;
+                    scores[(chunk_start + bi) * n + di] = sums_arr[bi] as u32;
                 }
             }
             chunk_start += CHUNK;
diff --git a/src/util.rs b/src/util.rs
index 5f9eb1dd..eecdb2ef 100644
--- a/src/util.rs
+++ b/src/util.rs
@@ -124,8 +124,18 @@ pub(crate) fn l2_normalise_into(out: &mut Vec<f32>, v: &[f32]) {
 /// validate separately; this is the Rust-side backstop.
 #[inline]
 pub(crate) fn assert_all_finite(v: &[f32]) {
+    // Large ingest batches pay a full serial pass here (measured ~0.1s per
+    // GiB); split the scan across the pool once it dwarfs the fork cost.
+    const PARALLEL_THRESHOLD: usize = 1 << 20;
+    let all_finite = if v.len() >= PARALLEL_THRESHOLD {
+        use rayon::prelude::*;
+        v.par_chunks(1 << 18)
+            .all(|c| c.iter().all(|x| x.is_finite()))
+    } else {
+        v.iter().all(|x| x.is_finite())
+    };
     assert!(
-        v.iter().all(|x| x.is_finite()),
+        all_finite,
         "ordvec: input contains non-finite (NaN or ±Inf) values; embeddings must be finite"
     );
 }
diff --git a/tests/tiled_candgen.rs b/tests/tiled_candgen.rs
new file mode 100644
index 00000000..33ac4144
--- /dev/null
+++ b/tests/tiled_candgen.rs
@@ -0,0 +1,175 @@
+//! Contract-pinning tests for sign candidate generation, written ahead of the
+//! tiled internals swap of `top_m_candidates` /
+//! `top_m_candidates_batched_serial_csr`. The oracle is independent of the
+//! implementation under test: `score_all` (dense agreement counts) plus a
+//! full lexicographic sort by `(hamming asc, doc_id asc)`. These tests pin
+//! today's behavior exactly — including tie handling at the m-th position —
+//! and must pass bit-identically before and after the swap.
+
+use ordvec::SignBitmap;
+
+/// Deterministic xorshift so corpora are reproducible without a rand dep.
+struct XorShift(u64);
+
+impl XorShift {
+    fn next_f32(&mut self) -> f32 {
+        self.0 ^= self.0 << 13;
+        self.0 ^= self.0 >> 7;
+        self.0 ^= self.0 << 17;
+        // Map to [-1, 1) with plenty of sign variety.
+        ((self.0 >> 40) as f32 / 8_388_608.0) - 1.0
+    }
+}
+
+fn random_corpus(dim: usize, n: usize, seed: u64) -> Vec<f32> {
+    let mut rng = XorShift(seed | 1);
+    (0..n * dim).map(|_| rng.next_f32()).collect()
+}
+
+/// Tie-heavy corpus: every coordinate is +/-1 drawn from a tiny pattern set,
+/// so hamming distances collide massively and the (hamming, doc_id)
+/// tie-break does real work at the selection boundary.
+fn tie_heavy_corpus(dim: usize, n: usize) -> Vec<f32> {
+    (0..n)
+        .flat_map(|doc| {
+            let pattern = doc % 4;
+            (0..dim).map(move |c| if (c + pattern) % 3 == 0 { -1.0 } else { 1.0 })
+        })
+        .collect()
+}
+
+fn oracle_top_m(sign: &SignBitmap, q: &[f32], m: usize) -> Vec<u32> {
+    let dim_u32 = u32::try_from(q.len()).unwrap();
+    // score_all returns agreement (dim - hamming), higher is better.
+    let agreements = sign.score_all(q);
+    let mut ids: Vec<u32> = (0..agreements.len() as u32).collect();
+    ids.sort_by_key(|&i| (dim_u32 - agreements[i as usize], i));
+    ids.truncate(m.min(agreements.len()));
+    ids
+}
+
+fn assert_contract(dim: usize, vectors: &[f32], queries: &[f32], m: usize, label: &str) {
+    let mut sign = SignBitmap::new(dim);
+    sign.add(vectors);
+    let nq = queries.len() / dim;
+
+    // Single-query path.
+    for qi in 0..nq {
+        let q = &queries[qi * dim..(qi + 1) * dim];
+        let got = sign.top_m_candidates(q, m);
+        let want = oracle_top_m(&sign, q, m);
+        assert_eq!(
+            got, want,
+            "{label}: single-query mismatch at query {qi}, m={m}"
+        );
+    }
+
+    // Batched serial CSR path: row qi must equal the single-query result.
+    let cb = sign.top_m_candidates_batched_serial_csr(queries, m);
+    assert_eq!(cb.offsets.len(), nq + 1, "{label}: CSR offsets length");
+    for qi in 0..nq {
+        let row = &cb.candidates[cb.offsets[qi]..cb.offsets[qi + 1]];
+        let want = oracle_top_m(&sign, &queries[qi * dim..(qi + 1) * dim], m);
+        assert_eq!(
+            row,
+            &want[..],
+            "{label}: CSR row mismatch at query {qi}, m={m}"
+        );
+    }
+}
+
+/// Random corpus large enough to span many doc blocks under any plausible
+/// tile size, at a SIMD-friendly dim.
+#[test]
+fn random_corpus_matches_oracle_across_block_boundaries() {
+    // dim=512 -> 8 qwords/vec -> 4096-doc blocks; n=10240 spans three
+    // blocks including a final partial one (audit: the previous dim=128
+    // shape fit in a single block, so the loop never crossed a boundary).
+    let dim = 512;
+    let n = 10_240;
+    let vectors = random_corpus(dim, n, 0xC0FFEE);
+    let queries = random_corpus(dim, 33, 0xBEEF);
+    for m in [1, 7, 256, 500] {
+        assert_contract(dim, &vectors, &queries, m, "random");
+    }
+}
+
+/// Massive hamming ties: selection at the boundary is decided purely by
+/// doc_id ascending. This is the case a streaming collector most easily gets
+/// subtly wrong.
+#[test]
+fn tie_heavy_corpus_selects_lowest_doc_ids_at_boundary() {
+    let dim = 64;
+    let n = 4_096;
+    let vectors = tie_heavy_corpus(dim, n);
+    let queries = random_corpus(dim, 9, 0xABCD);
+    for m in [1, 3, 100, 1_000] {
+        assert_contract(dim, &vectors, &queries, m, "tie-heavy");
+    }
+}
+
+/// Exact duplicate documents: every duplicate group is one giant tie run,
+/// longer than m, exercising equal-hamming runs that exceed the collector.
+#[test]
+fn duplicate_documents_tie_runs_longer_than_m() {
+    let dim = 64;
+    let base = random_corpus(dim, 8, 0x1234);
+    // 8 distinct vectors, each repeated 512 times => tie runs of 512.
+    let mut vectors = Vec::with_capacity(8 * 512 * dim);
+    for rep in 0..512 {
+        let _ = rep;
+        vectors.extend_from_slice(&base);
+    }
+    let queries = random_corpus(dim, 5, 0x9999);
+    for m in [10, 100, 513] {
+        assert_contract(dim, &vectors, &queries, m, "duplicates");
+    }
+}
+
+/// Edge geometry: m >= n, m == n, single doc, single query, nq == 0.
+#[test]
+fn edge_geometries_match_oracle() {
+    let dim = 64;
+    let vectors = random_corpus(dim, 17, 0x42);
+    let queries = random_corpus(dim, 3, 0x43);
+    for m in [17, 25, 1] {
+        assert_contract(dim, &vectors, &queries, m, "edge");
+    }
+
+    let single_doc = random_corpus(dim, 1, 0x77);
+    assert_contract(dim, &single_doc, &queries, 4, "single-doc");
+
+    // Empty query batch: CSR must be a single zero offset and no candidates.
+    let mut sign = SignBitmap::new(dim);
+    sign.add(&vectors);
+    let cb = sign.top_m_candidates_batched_serial_csr(&[], 8);
+    assert_eq!(cb.offsets, vec![0]);
+    assert!(cb.candidates.is_empty());
+}
+
+/// Large-dim smoke at the shape the arXiv corpus uses (1024 dims), enough
+/// rows to cross several L2-sized doc blocks.
+#[test]
+fn dim_1024_shape_matches_oracle() {
+    let dim = 1024;
+    let n = 6_000;
+    let vectors = random_corpus(dim, n, 0xA5A5);
+    let queries = random_corpus(dim, 8, 0x5A5A);
+    for m in [256, 320] {
+        assert_contract(dim, &vectors, &queries, m, "dim1024");
+    }
+}
+
+/// AVX-512 tail residue (dim=768 -> qpv=12, rem=4) composed with
+/// multi-block crossing and a final partial block — the kernel-shape case
+/// the audit flagged as untested in the permanent suite.
+#[test]
+fn dim_768_tail_residue_crosses_blocks() {
+    let dim = 768;
+    let n = 3_200; // block_docs = 262144/96 = 2730 -> 2 blocks, partial tail
+    let vectors = random_corpus(dim, n, 0x7E57);
+    let queries = random_corpus(dim, 7, 0x7E58);
+    for m in [64, 320] {
+        assert_contract(dim, &vectors, &queries, m, "dim768-tail");
+    }
+}