diff --git a/CHANGELOG.md b/CHANGELOG.md index b75da59d..1888c618 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,88 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 _No unreleased changes._ +## 0.6.0 - 2026-07-04 + +### Performance + +- **Batched sign candidate generation now streams the corpus once per call.** + `SignBitmap::top_m_candidates_batched_serial_csr` previously looped the + single-query path, re-streaming the full sign bitmap per query (the + documented-naive first cut). The internals now scan the corpus once per call + in L2-sized doc blocks, score every query of the call against each hot block + in query tiles via the existing batched kernel, and select per-query top-m + with bounded `(hamming, doc_id)` min-collectors — bit-identical to a full + sort by construction, independent of processing order (the key *is* the + contract's sort key), pinned by an independent oracle suite + (`tests/tiled_candgen.rs`) across random, tie-heavy, duplicate-run, and edge + geometries. Per-query corpus traffic drops by the call's query count: at + 1.26M rows × dim=1024, a 2048-query call reads the 161 MB sign sidecar once + instead of 2048 times. `top_m_candidates` routes through the same core + (dropping its per-call n-row Hamming materialisation) except at `nq=1`, + which keeps the dense partition path — the streamed core measured +50–90% + single-query time at small/medium `n` with `m` in the hundreds (bounded heap + `O(n log m)` vs `select_nth_unstable_by` `O(n)`), while the dense path is + parity-or-better at every measured size. The serial contract covers the + candidate scan and selection (no rayon there; callers own that + parallelism) — input finite-validation on large query buffers may + briefly use the global rayon pool (order-independent, deterministic). + `top_m_candidates_batched` (the internally-parallel convenience) is + unchanged by this work. Together with the collector worst-bound change below, measured + downstream in a two-stage retrieval stack at 1.26M × 1024: batched search + throughput 220 → 10.2k queries/s, results bit-identical. +- **Candidate-collector accept test reduced to a cached worst-bound compare.** + Doc ids visit each per-query heap strictly ascending, so a candidate tying + the worst kept hamming always loses the `(hamming, doc_id)` tie-break — once + the collector is full, the accept test is exactly `hamming < worst kept + hamming`. That bound is now cached in a register-friendly `u32` (`u32::MAX` + while filling), skipping the heap peek + tuple compare on the ~99.8% reject + path. Bit-identical by construction; pinned by the tie-heavy and + duplicate-run oracle suites. +- **Parallel finite-input validation and scratch-based rank encode.** + `assert_all_finite` paid a full serial pass per add/search batch — measured + ~0.1 s per GiB, twice per ingest batch counting the caller layer. Scans of + 1M+ floats now split across the rayon pool (4.4× measured). + `RankQuant::add`'s per-row closure allocated a fresh ranks `Vec` per vector + inside the parallel loop; it now reuses a per-worker scratch via + `rank_transform_into`. Measured on a 1.26M × 1024 corpus slice: encode-path + validation attribution 0.097 s serial scan → 0.022 s parallel, with the + per-vector allocation churn removed from the hot loop. +- **LUT + parallel constant-composition check on `RankQuant` load.** + `load_rankquant`'s forged-buffer defense histogrammed every packed code + serially — 1.29 billion shift/mask ops at 1.26M × 1024, ~1 s of the 1.27 s + verified open. A 4 KB per-byte bucket-count LUT replaces the per-code inner + loop and rows validate in parallel; `find_first` keeps the + lowest-offending-row error contract, with a scalar recheck producing the + identical message. The security property is unchanged: every row still + proves uniform composition before the index is usable. Measured verified + open at 1.26M × 1024: 1.27 s → 0.38 s. + +### Changed + +- **ordvec-manifest: derived artifact size bounds.** Verification now bounds + every artifact read by its manifest-declared `file_size_bytes` (the manifest + itself remains hard-capped at 1 MiB and SHA-256 pins content); manifest + creation bounds reads by the artifact's observed size. The flat + `ResourceLimits` byte caps (`max_auxiliary_artifact_bytes`, + `max_calibration_profile_bytes`, `max_encoder_distortion_profile_bytes`) + are now explicit opt-in ceilings and default to unbounded — previously the + 64 MiB auxiliary default made legitimate large sign sidecars (>524,288 rows + at dim=1024) impossible to write with default options. +- **ordvec-manifest: primary artifact reads are now bounded.** The primary + index artifact is hashed under its declared size (new + `artifact_file_too_large` reason code); previously this read was unbounded. + An artifact grown past its declaration now fails fast at the read bound + instead of surfacing as a digest mismatch after hashing the excess. +- **ordvec-manifest: primary index artifact gains an opt-in ceiling.** New + `ResourceLimits::max_index_artifact_bytes` (default unbounded) mirrors the + auxiliary/profile ceilings; the create path also bounds the primary read by + its observed size. Note: a grown artifact now surfaces as + `*_file_too_large` (fail-fast) rather than `*_file_size_mismatch`, which + now indicates truncation only. +- **ordvec-manifest: bounded hashing streams with constant memory.** + `sha256_file_bounded` no longer materialises the file in memory before + hashing. + ## 0.5.0 - 2026-06-19 ### Security diff --git a/Cargo.lock b/Cargo.lock index ddec2dc5..0fe3b9f2 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -844,7 +844,7 @@ checksum = "384b8ab6d37215f3c5301a95a4accb5d64aa607f1fcb26a11b5303878451b4fe" [[package]] name = "ordvec" -version = "0.5.0" +version = "0.6.0" dependencies = [ "rand 0.10.1", "rand_chacha 0.10.0", @@ -854,14 +854,14 @@ dependencies = [ [[package]] name = "ordvec-ffi" -version = "0.5.0" +version = "0.6.0" dependencies = [ "ordvec", ] [[package]] name = "ordvec-manifest" -version = "0.5.0" +version = "0.6.0" dependencies = [ "chrono", "clap", @@ -877,7 +877,7 @@ dependencies = [ [[package]] name = "ordvec-manifest-python" -version = "0.5.0" +version = "0.6.0" dependencies = [ "ordvec-manifest", "pyo3", @@ -887,7 +887,7 @@ dependencies = [ [[package]] name = "ordvec-python" -version = "0.5.0" +version = "0.6.0" dependencies = [ "numpy", "ordvec", diff --git a/Cargo.toml b/Cargo.toml index 065d8e2c..4c3ee3f9 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "ordvec" -version = "0.5.0" +version = "0.6.0" edition = "2021" rust-version = "1.89" # AVX-512 intrinsics stabilized in 1.89.0; also clears the 1.87 floor from u64::is_multiple_of description = "Training-free ordinal & sign quantization for vector retrieval" diff --git a/README.md b/README.md index 76910e5d..0442cd96 100644 --- a/README.md +++ b/README.md @@ -38,9 +38,10 @@ append-friendly, and graph-optional. > trec-covid run below; the harness also supports nfcorpus and fiqa. ordvec wins > single-query latency against exact `flat` on the committed 171K-doc run and on > operability (no build, no tuning, append-only); in the committed default-method -> threaded view, HNSW still wins highly-parallel batched serving. Larger-corpus -> and alternate-encoder studies are active research, not public release claims -> until their artifacts land in this repository.** +> threaded view, HNSW still leads highly-parallel batched serving, though 0.6.0's +> once-per-call corpus streaming narrowed that margin (see the threaded view +> below). Larger-corpus and alternate-encoder studies are active research, not +> public release claims until their artifacts land in this repository.** **Public evidence snapshot.** The load-bearing result in this README is narrower than the research backlog: Harrier-Q8 embeddings on public BEIR data, scored @@ -60,8 +61,8 @@ and the gap widens over the committed subsampling sweep: ![ordvec speedup over exact search grows with corpus size](https://raw.githubusercontent.com/Project-Navi/ordvec/main/benchmarks/beir/figures/scaling_curve.png) - **~100× faster than exact `flat`, single query, at 171K docs.** Single-query - latency: exact `flat` 56 ms vs ordvec `Sign→rq2` **0.53 ms** — the gap over `flat` - grows with the corpus (it is ~5× at 1K docs). + latency: exact `flat` 52.4 ms vs ordvec `Sign→rq2` **0.52 ms (≈101×)** — the gap + over `flat` grows with the corpus (it is ~4.4× at 1K docs). - **8–16× smaller for the reported qrel rows.** The b=2 rank code is 256 B/vector (16× smaller than 4096 B floats), b=4 is 512 B (8×), and the reported two-stage `sign→rq2` row accounts for both stage-1 sign codes and the RankQuant reranker @@ -70,8 +71,10 @@ and the gap widens over the committed subsampling sweep: followed by RankQuant b=2 rerank. At **nDCG@10 within bootstrap noise of exact** (on trec-covid the ordinal rows even edge ahead; see [Benchmarks](#benchmarks)). - **vs HNSW (the honest public scale story).** On the committed trec-covid run, - ordvec wins single-query latency while HNSW wins the highly-parallel threaded - view. That is the public comparison here. At larger corpora, graph or shard + ordvec wins single-query latency (≈3× at batch 1) while HNSW leads the + highly-parallel threaded view — by 1.6× over `sign→rq2` and 1.2× over + `bitmap→rq2` after 0.6.0's batched candidate generation (previously ≈2.3×). + That is the public comparison here. At larger corpora, graph or shard layers are the right comparison target; this README does not claim public million-scale HNSW crossover or GPU bandwidth numbers until the underlying run artifacts are committed. @@ -209,7 +212,7 @@ Details in [`docs/RANK_MODES.md`](docs/RANK_MODES.md). ```toml [dependencies] -ordvec = "0.5" +ordvec = "0.6" # Or, to track unreleased `main`, use a git dependency instead: # ordvec = { git = "https://github.com/Project-Navi/ordvec" } @@ -384,13 +387,13 @@ run; regenerate your own with `make benchmark-beir`. | Dataset | Method | Bytes/vec | nDCG@10 | Δ vs flat (95% CI) | |---|---|--:|--:|---| -| scifact (5,183) | `flat` (exact) | 4096 | 0.7551 | (baseline) | -| | `hnsw` M=32 | 4096 + graph | 0.7554 | +0.0003 * | -| | **ordvec rq4** | **512** | **0.7549** | −0.0003 * | -| | ordvec rq2 | 256 | 0.7471 | −0.0080 * | -| | ordvec sign→rq2 | 384 | 0.7471 | −0.0080 * | +| scifact (5,183) | `flat` (exact) | 4096 | 0.7559 | (baseline) | +| | `hnsw` M=32 | 4096 + graph | 0.7573 | +0.0014 * | +| | **ordvec rq4** | **512** | **0.7580** | +0.0021 * | +| | ordvec rq2 | 256 | 0.7484 | −0.0075 * | +| | ordvec sign→rq2 | 384 | 0.7484 | −0.0075 * | | trec-covid (171,332) | `flat` (exact) | 4096 | 0.7574 | (baseline) | -| | `hnsw` M=32 | 4096 + graph | 0.7555 | −0.0019 * | +| | `hnsw` M=32 | 4096 + graph | 0.7600 | +0.0026 * | | | ordvec rq2 | 256 | 0.7632 | +0.0057 * | | | **ordvec rq4** | **512** | **0.7636** | +0.0062 * | | | ordvec sign→rq2 | 384 | 0.7638 | +0.0064 * | @@ -411,34 +414,38 @@ views (trec-covid, 171,332 docs, 1024-d): ![single-query latency bars](https://raw.githubusercontent.com/Project-Navi/ordvec/main/benchmarks/beir/figures/bars_single_thread.png) -`flat` 56 ms → ordvec `sign→rq2` **0.53 ms (≈106×)**, `bitmap→rq2` 0.62 ms (≈91×), -`hnsw` 1.5 ms (37×). The scaling curve [above](#benchmark-at-a-glance) is this +`flat` 52.4 ms → ordvec `sign→rq2` **0.52 ms (≈101×)**, `bitmap→rq2` 0.58 ms (≈90×), +`hnsw` 1.5 ms (≈34×). The scaling curve [above](#benchmark-at-a-glance) is this view swept over the committed subsamples — the speedup over `flat` grows across that public sweep. **2. Batched throughput (batch = 32, 1 thread)** — when many queries arrive at -once, `flat`'s GEMM amortizes the corpus stream across the batch (56→4 ms), -narrowing the gap: ordvec `sign→rq2`/`bitmap→rq2` stay ≈8–9.5× ahead. +once, `flat`'s GEMM amortizes the corpus stream across the batch (52→3.8 ms). +Since 0.6.0, ordvec's batched candidate generation amortizes the same way — the +serial CSR path streams the corpus **once per call** instead of once per query +(1.69× on the committed synthetic two-stage bench) — so `sign→rq2` 0.33 ms / +`bitmap→rq2` 0.38 ms stay **≈10–12× ahead** of batched `flat`. **3. Many cores (batch = 32, 32 threads)** — everything parallelizes and the field compresses; `hnsw` threads best: ![threaded throughput bars](https://raw.githubusercontent.com/Project-Navi/ordvec/main/benchmarks/beir/figures/bars_threaded.png) -`hnsw` 4.8× vs `flat`, ordvec `bitmap→rq2` 3.7×, `rq2` 2.5×, `sign→rq2` 2.1×. +`hnsw` 4.9× vs `flat`, ordvec `bitmap→rq2` 4.0×, `sign→rq2` 3.1×, `rq2` 2.2×. This committed chart uses the default `sign-rq2` row, not the newer within-query-threaded `sign-rq2-threaded` probe row; regenerate public figures before using that probe for release claims. In this default-method view, -**HNSW wins this regime** — by a hair on threaded throughput. The honest +**HNSW still leads this regime** — 1.6× over `sign→rq2` (≈2.3× before 0.6.0's +once-per-call corpus streaming) and 1.2× over `bitmap→rq2`. The honest ordvec-vs-HNSW tradeoff, all from this same run (trec-covid, 171,332 docs): | | HNSW M=32 | ordvec `sign→rq2` | |---|---|---| -| threaded latency (32 threads, batch 32) | **0.23 ms** ✅ | 0.52 ms | -| single-query latency (batch 1) | 1.52 ms | **0.53 ms** ✅ (~3×) | +| threaded latency (32 threads, batch 32) | **0.20 ms** ✅ | 0.32 ms | +| single-query latency (batch 1) | 1.52 ms | **0.52 ms** ✅ (~3×) | | index size / vector | 4096 B + graph | **256–384 B** ✅ (8–16× less) | -| build time, 171K docs | **51.4 s** | **0.26 s** ✅ (training-free) | -| nDCG@10 (trec-covid) | 0.7555 | **0.7638** ✅ | +| build time, 171K docs | **47.1 s** | **0.21 s** ✅ (training-free) | +| nDCG@10 (trec-covid) | 0.7600 | **0.7638** ✅ | So even where HNSW edges ahead on threaded latency, ordvec gets there with **no graph to build** (instant, training-free, and rebuilt for free when the corpus diff --git a/RELEASING.md b/RELEASING.md index 6cac74e3..f56df20c 100644 --- a/RELEASING.md +++ b/RELEASING.md @@ -173,6 +173,11 @@ the OIDC exchange (no risk of a bad publish; just a failed run). lockstep versions, MSRV/docs drift, registry metadata parity, Python classifier/URL parity, docs.rs feature policy, package contents, and release workflow invariants. + - **Downstream un-patch (one-time, 0.6.0):** OrdinalDB's workspace + `Cargo.toml` carries a `[patch.crates-io]` block pointing `ordvec` and + `ordvec-manifest` at this repo's `integration/full-stack` git branch. + When 0.6.0 publishes, that block must be removed so OrdinalDB consumes + the published crates.io releases instead of the pre-release git branch. 4. Confirm CI is **green for current `main` HEAD**. `require-ci-green` checks `main` HEAD's SHA — which needs a **completed, successful** (not `cancelled`, not in-progress) run of `ci.yml`, `python.yml`, `fuzz.yml`, diff --git a/THREAT_MODEL.md b/THREAT_MODEL.md index 9fa2344b..a096f292 100644 --- a/THREAT_MODEL.md +++ b/THREAT_MODEL.md @@ -1,6 +1,6 @@ # Threat Model — `ordvec` -> **Status:** v0.5.0 (pre-1.0), 2026-06-15. This is the maintained threat model +> **Status:** v0.6.0 (pre-1.0), 2026-06-15. This is the maintained threat model > for the `ordvec` Rust crate, C ABI, Go wrapper, PyO3/maturin Python bindings, > and the `ordvec-manifest` sidecar verifier. It is reviewed when the > attack surface changes (new persistence formats, new `unsafe` kernels, new @@ -397,6 +397,19 @@ enforce service-level quotas — by design (it is a library, not a server). batch size, `k`, request rate, and corpus size; a configurable `max_nq` / `max_k` at the binding level is a possible future convenience. +**THREAT-QUERY-003 (P2): Artifact read bounds are derived, not flat.** +Verification bounds every artifact read by its manifest-declared +`file_size_bytes` (the manifest itself is hard-capped at 1 MiB before JSON +parsing, and SHA-256 pins artifact content); manifest creation bounds reads +by the artifact's observed size. Bounded hashing streams with constant +memory, so a hostile manifest cannot cause unbounded memory growth — but it +CAN still cause I/O and CPU proportional to the byte size it declares and +actually supplies on disk. The flat `ResourceLimits` byte caps are opt-in +ceilings (unbounded by default) for deployments that must bound worst-case +verification time on attacker-supplied bundles. A `VerifiedLoadPlan` remains +a verification snapshot, not a byte pin: bytes can change between +verification and use by a local actor with write access (see scope). + **THREAT-QUERY-002 (P3): Panic on contract violation in Rust server contexts.** Rust APIs fail fast on invalid contract input (non-finite floats, dimension / shape violations) via `assert!` / `expect`. In a Rust-native server an diff --git a/benchmarks/beir/figures/bars_single_thread.png b/benchmarks/beir/figures/bars_single_thread.png index 5fb4b371..f989c60c 100644 Binary files a/benchmarks/beir/figures/bars_single_thread.png and b/benchmarks/beir/figures/bars_single_thread.png differ diff --git a/benchmarks/beir/figures/bars_threaded.png b/benchmarks/beir/figures/bars_threaded.png index 8f14a291..0b1c2bcc 100644 Binary files a/benchmarks/beir/figures/bars_threaded.png and b/benchmarks/beir/figures/bars_threaded.png differ diff --git a/benchmarks/beir/figures/scaling_curve.png b/benchmarks/beir/figures/scaling_curve.png index b771a452..cae50363 100644 Binary files a/benchmarks/beir/figures/scaling_curve.png and b/benchmarks/beir/figures/scaling_curve.png differ diff --git a/benchmarks/rank_modes_results.txt b/benchmarks/rank_modes_results.txt index da96d2d5..8afe66e9 100644 --- a/benchmarks/rank_modes_results.txt +++ b/benchmarks/rank_modes_results.txt @@ -12,7 +12,7 @@ # Corpus: SYNTHETIC low-rank clustered corpus, seed = 1 (CORPUS_SEED), in-process. # Config: dim=256 n=30000 queries=200 k=10 (the self-contained default). # Hardware class: x86_64 desktop, AMD Ryzen 9 9950X (AVX-512), 32 rayon threads. -# Toolchain: rustc 1.95.0, release profile (opt-level 3 + LTO, codegen-units 1). +# Toolchain: rustc 1.95.0 (59807616e 2026-04-14), release profile (opt-level 3 + LTO, codegen-units 1). # # DETERMINISM: the QUALITY columns are seeded and bit-identical run-to-run on # the same machine — verified by two back-to-back runs (R@10, CR, bytes/vec, @@ -51,34 +51,38 @@ # --corpus-npy /path/to/corpus.npy --queries-npy /path/to/queries.npy # =========================================================================== +# Refresh note (0.6.0 batch work): the per-query latency rows in this table +# measure SINGLE-QUERY single-thread scans — paths intentionally unchanged +# by the 0.6.0 batched candidate-generation rework (verified: identical +# within noise between the pre- and post-rework code on the same toolchain +# and machine). The encode columns improved (parallel finite validation + +# scratch-based rank encode). Batched candidate generation improvements are +# measured by examples/two_stage_bench (see +# two_stage_caller_owned_dim1024.txt: stage-1 1.69x on the committed +# workload) — they do not appear in this single-query table by design. + target arch x86_64 / opt-level 3 + lto (release profile) -x86_64 features detected: sse4.2, avx2, fma, avx512f, avx512bw, avx512vl -rayon threads = 32 (encode + brute-force GT are parallelised; per-query latency rows measure single-thread scan) -generating low-rank clustered corpus (clusters=200, latent=64) ... - done in 0.17s (seed=1, self-contained) -bench_rank: dim=256 n=30000 queries=200 k=10 -FP32 brute-force ground truth ... - done in 0.03s + mode bytes/vec total MiB encode v/s p50 ms p99 ms GiB/s ns/dim Mdocs/s scan R@10 ------------------------------------------------------------------------------------------------------------------------------------ -RankIndex sym 512 14.6 4559550 3.959 4.379 3.61 0.515 7.58 0.7825 -RankIndex asym 512 14.6 4559550 3.712 4.012 3.85 0.483 8.08 0.8450 -RankQuant b=2 sym 64 1.8 5251083 2.534 2.761 0.71 0.330 11.84 0.4660 -RankQuant b=2 asym 64 1.8 5251083 0.238 0.245 7.51 0.031 125.94 0.5715 -RankQuant b=2 asym byte-LUT 64 1.8 5095754 0.754 0.764 2.37 0.098 39.78 0.5715 -RankQuant b=2 fastscan 128 3.7 283630 0.090 0.093 39.69 0.012 332.93 0.5700 -RankQuant b=4 sym 128 3.7 5205223 2.634 2.885 1.36 0.343 11.39 0.7475 -RankQuant b=4 asym 128 3.7 5205223 0.313 0.317 11.42 0.041 95.79 0.8055 -RankQuant b=4 asym byte-LUT 128 3.7 5324938 1.644 1.662 2.18 0.214 18.25 0.8055 -RankQuant b=1 sym 32 0.9 5523695 2.467 2.745 0.36 0.321 12.16 0.2785 -RankQuant b=1 asym 32 0.9 5523695 2.446 2.478 0.37 0.318 12.26 0.3470 -Bitmap n_top=64 32 0.9 5576810 0.081 0.084 11.02 0.011 369.67 0.2480 -SignBitmap probe 32 0.9 19641040 0.091 0.099 9.81 0.012 329.12 0.2880 -TwoStage b=2 M=100 CR=0.976 96 2.7 2689552 0.098 0.107 27.45 0.013 306.99 0.5700 -TwoStage b=2 M=500 CR=1.000 96 2.7 2669862 0.109 0.122 24.62 0.014 275.39 0.5715 -TwoStage b=2 M=1000 CR=1.000 96 2.7 2742585 0.122 0.135 21.90 0.016 244.94 0.5715 -TwoStage b=2 M=5000 CR=1.000 96 2.7 2674849 0.240 0.253 11.19 0.031 125.10 0.5715 -SignTwoStage b=2 M=500 CR=1.000 96 2.7 4038493 0.106 0.114 25.37 0.014 283.74 0.5715 +Rank sym 512 14.6 4175858 3.727 4.330 3.84 0.485 8.05 0.7805 +Rank asym 512 14.6 4175858 3.537 4.008 4.04 0.461 8.48 0.8330 +RankQuant b=2 sym 64 1.8 4246863 2.535 3.063 0.71 0.330 11.84 0.4555 +RankQuant b=2 asym 64 1.8 4246863 0.297 0.313 6.01 0.039 100.85 0.5785 +RankQuant b=2 asym byte-LUT 64 1.8 4399340 0.630 0.785 2.84 0.082 47.65 0.5785 +RankQuant b=2 fastscan 128 3.7 249604 0.109 0.112 32.90 0.014 275.98 0.5845 +RankQuant b=4 sym 128 3.7 4221293 2.602 3.182 1.37 0.339 11.53 0.7425 +RankQuant b=4 asym 128 3.7 4221293 0.373 0.508 9.59 0.049 80.48 0.8095 +RankQuant b=4 asym byte-LUT 128 3.7 4299550 1.217 1.693 2.94 0.158 24.65 0.8095 +RankQuant b=1 sym 32 0.9 4435777 2.400 2.815 0.37 0.313 12.50 0.2890 +RankQuant b=1 asym 32 0.9 4435777 2.394 2.801 0.37 0.312 12.53 0.3790 +Bitmap n_top=64 32 0.9 4260526 0.064 0.066 14.02 0.008 470.51 0.2495 +SignBitmap probe 32 0.9 14021070 0.045 0.053 19.81 0.006 664.67 0.2745 +TwoStage b=2 M=100 CR=0.978 96 2.7 2356573 0.053 0.059 51.04 0.007 570.90 0.5795 +TwoStage b=2 M=500 CR=1.000 96 2.7 2462261 0.067 0.081 39.87 0.009 445.98 0.5785 +TwoStage b=2 M=1000 CR=1.000 96 2.7 2256172 0.082 0.091 32.78 0.011 366.64 0.5785 +TwoStage b=2 M=5000 CR=1.000 96 2.7 2339041 0.186 0.198 14.39 0.024 160.97 0.5785 +SignTwoStage b=2 M=500 CR=1.000 96 2.7 3605670 0.064 0.074 41.71 0.008 466.56 0.5785 -{"dim":256,"n":30000,"queries":200,"k":10,"rows":[{"name":"RankIndex sym","bytes_per_vec":512,"total_mib":14.648,"encode_vps":4559549.8,"p50_ms":3.9589,"p99_ms":4.3786,"gib_per_sec":3.613,"ns_per_dim":0.5155,"docs_per_sec":7577944.8,"recall_at_10_vs_fp32":0.7825},{"name":"RankIndex asym","bytes_per_vec":512,"total_mib":14.648,"encode_vps":4559549.8,"p50_ms":3.7118,"p99_ms":4.0118,"gib_per_sec":3.854,"ns_per_dim":0.4833,"docs_per_sec":8082284.1,"recall_at_10_vs_fp32":0.8450},{"name":"RankQuant b=2 sym","bytes_per_vec":64,"total_mib":1.831,"encode_vps":5251083.2,"p50_ms":2.5342,"p99_ms":2.7609,"gib_per_sec":0.706,"ns_per_dim":0.3300,"docs_per_sec":11837877.9,"recall_at_10_vs_fp32":0.4660},{"name":"RankQuant b=2 asym","bytes_per_vec":64,"total_mib":1.831,"encode_vps":5251083.2,"p50_ms":0.2382,"p99_ms":0.2448,"gib_per_sec":7.507,"ns_per_dim":0.0310,"docs_per_sec":125940354.6,"recall_at_10_vs_fp32":0.5715},{"name":"RankQuant b=2 asym byte-LUT","bytes_per_vec":64,"total_mib":1.831,"encode_vps":5095754.3,"p50_ms":0.7542,"p99_ms":0.7642,"gib_per_sec":2.371,"ns_per_dim":0.0982,"docs_per_sec":39777827.6,"recall_at_10_vs_fp32":0.5715},{"name":"RankQuant b=2 fastscan","bytes_per_vec":128,"total_mib":3.664,"encode_vps":283630.2,"p50_ms":0.0901,"p99_ms":0.0926,"gib_per_sec":39.688,"ns_per_dim":0.0117,"docs_per_sec":332930118.0,"recall_at_10_vs_fp32":0.5700},{"name":"RankQuant b=4 sym","bytes_per_vec":128,"total_mib":3.662,"encode_vps":5205222.9,"p50_ms":2.6344,"p99_ms":2.8850,"gib_per_sec":1.358,"ns_per_dim":0.3430,"docs_per_sec":11387896.0,"recall_at_10_vs_fp32":0.7475},{"name":"RankQuant b=4 asym","bytes_per_vec":128,"total_mib":3.662,"encode_vps":5205222.9,"p50_ms":0.3132,"p99_ms":0.3165,"gib_per_sec":11.419,"ns_per_dim":0.0408,"docs_per_sec":95788804.8,"recall_at_10_vs_fp32":0.8055},{"name":"RankQuant b=4 asym byte-LUT","bytes_per_vec":128,"total_mib":3.662,"encode_vps":5324938.4,"p50_ms":1.6437,"p99_ms":1.6621,"gib_per_sec":2.176,"ns_per_dim":0.2140,"docs_per_sec":18251816.7,"recall_at_10_vs_fp32":0.8055},{"name":"RankQuant b=1 sym","bytes_per_vec":32,"total_mib":0.916,"encode_vps":5523695.1,"p50_ms":2.4667,"p99_ms":2.7455,"gib_per_sec":0.362,"ns_per_dim":0.3212,"docs_per_sec":12161849.9,"recall_at_10_vs_fp32":0.2785},{"name":"RankQuant b=1 asym","bytes_per_vec":32,"total_mib":0.916,"encode_vps":5523695.1,"p50_ms":2.4461,"p99_ms":2.4776,"gib_per_sec":0.366,"ns_per_dim":0.3185,"docs_per_sec":12264561.3,"recall_at_10_vs_fp32":0.3470},{"name":"Bitmap n_top=64","bytes_per_vec":32,"total_mib":0.916,"encode_vps":5576810.4,"p50_ms":0.0812,"p99_ms":0.0838,"gib_per_sec":11.017,"ns_per_dim":0.0106,"docs_per_sec":369672100.8,"recall_at_10_vs_fp32":0.2480},{"name":"SignBitmap probe","bytes_per_vec":32,"total_mib":0.916,"encode_vps":19641040.3,"p50_ms":0.0912,"p99_ms":0.0985,"gib_per_sec":9.809,"ns_per_dim":0.0119,"docs_per_sec":329124200.5,"recall_at_10_vs_fp32":0.2880},{"name":"TwoStage b=2 M=100 CR=0.976","bytes_per_vec":96,"total_mib":2.747,"encode_vps":2689552.2,"p50_ms":0.0977,"p99_ms":0.1074,"gib_per_sec":27.447,"ns_per_dim":0.0127,"docs_per_sec":306987024.7,"recall_at_10_vs_fp32":0.5700},{"name":"TwoStage b=2 M=500 CR=1.000","bytes_per_vec":96,"total_mib":2.747,"encode_vps":2669861.7,"p50_ms":0.1089,"p99_ms":0.1216,"gib_per_sec":24.622,"ns_per_dim":0.0142,"docs_per_sec":275393583.3,"recall_at_10_vs_fp32":0.5715},{"name":"TwoStage b=2 M=1000 CR=1.000","bytes_per_vec":96,"total_mib":2.747,"encode_vps":2742584.6,"p50_ms":0.1225,"p99_ms":0.1347,"gib_per_sec":21.899,"ns_per_dim":0.0159,"docs_per_sec":244937949.1,"recall_at_10_vs_fp32":0.5715},{"name":"TwoStage b=2 M=5000 CR=1.000","bytes_per_vec":96,"total_mib":2.747,"encode_vps":2674848.6,"p50_ms":0.2398,"p99_ms":0.2534,"gib_per_sec":11.185,"ns_per_dim":0.0312,"docs_per_sec":125103731.8,"recall_at_10_vs_fp32":0.5715},{"name":"SignTwoStage b=2 M=500 CR=1.000","bytes_per_vec":96,"total_mib":2.747,"encode_vps":4038492.8,"p50_ms":0.1057,"p99_ms":0.1143,"gib_per_sec":25.369,"ns_per_dim":0.0138,"docs_per_sec":283744289.6,"recall_at_10_vs_fp32":0.5715}]} +JSON: diff --git a/benchmarks/two_stage_caller_owned_dim1024.txt b/benchmarks/two_stage_caller_owned_dim1024.txt index 765a3a65..f4399ebd 100644 --- a/benchmarks/two_stage_caller_owned_dim1024.txt +++ b/benchmarks/two_stage_caller_owned_dim1024.txt @@ -2,20 +2,26 @@ Caller-owned serial two-stage decomposition — Harrier-1024 shape (SYNTHETIC co Reproduce: cargo run --release --example two_stage_bench -- --dim 1024 --n 50000 --queries 200 --m 256 --k 10 --reps 15 Host: AMD Ryzen 9 9950X (Zen5), AVX-512 VPOPCNTDQ, single core (taskset -c 12), single-thread. +Toolchain: rustc 1.95.0 (59807616e 2026-04-14), release profile. dim=1024 n=50000 queries=200 m=256 k=10 bits=2 out_k=10 candidates=51200 reps=15 - 1. stage-1 candidate gen (CSR) 31.920 ms 6265.59 q/s 159.60 us/query - 2. single-query rerank loop 2.086 ms 95858.02 q/s 10.43 us/query - 3. batched rerank _into 2.031 ms 98463.67 q/s 10.16 us/query - 4. full two-stage (1+3) 34.485 ms 5799.70 q/s 172.42 us/query - rerank speedup (batched _into vs single-query loop): 1.03x + (dim % 64 == 0: AVX-512 tier eligible when supported) + 1. stage-1 candidate gen (CSR) 18.920 ms 10570.68 q/s 94.60 us/query + 2. single-query rerank loop 1.807 ms 110656.07 q/s 9.04 us/query + 3. batched rerank _into 1.780 ms 112367.69 q/s 8.90 us/query + 4. full two-stage (1+3) 20.750 ms 9638.68 q/s 103.75 us/query + rerank speedup (batched _into vs single-query loop): 1.02x -Interpretation (no-fiction): at dim=1024 the rerank stage is a small slice -(~10 us/query) of an already-stage-1-dominated two-stage cost (~160 us/query); -the batched _into form is on par with the single-query loop SINGLE-THREADED -(~1.03x). The caller-owned serial primitives are NOT a single-thread speedup — -their value is (a) allocation-free steady state (tests/alloc_free.rs proves 0 -heap allocations on a warmed _into call) and (b) caller-owned parallelism: no -internal rayon, so a DB/runtime can drive the _into form across its own bounded -pool (GIL released) one query-range per worker. This dim=1024 result is its own -mechanism; it is NOT explained by the SignBitmap AVX-tail dim=768 result. +Interpretation (no-fiction): stage-1 candidate generation now streams the +corpus ONCE per call in L2-sized doc blocks with bounded (hamming, doc_id) +collectors — 94.60 us/query vs 159.60 us/query for the same command on the +same host and pinning before the 0.6.0 batch work (1.69x; full two-stage +1.66x). Output is bit-identical (oracle-pinned in tests/tiled_candgen.rs). +The rerank stage is unchanged in design and remains a small slice +(~9 us/query). The caller-owned serial primitives still do NOT enter rayon +for scan/selection — a DB/runtime drives the _into form across its own pool +(input finite-validation of large query buffers may briefly use the global +pool; order-independent and deterministic). Their value remains (a) +allocation-free steady state and (b) caller-owned parallelism; at dim=1024 +the call-level scan sharing is now the dominant win and grows with the +batch size per call. diff --git a/fuzz/Cargo.lock b/fuzz/Cargo.lock index 46d6639e..2f9e2cde 100644 --- a/fuzz/Cargo.lock +++ b/fuzz/Cargo.lock @@ -231,7 +231,7 @@ checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50" [[package]] name = "ordvec" -version = "0.5.0" +version = "0.6.0" dependencies = [ "rayon", ] diff --git a/ordvec-ffi/Cargo.toml b/ordvec-ffi/Cargo.toml index 5dc9b78c..177a92ba 100644 --- a/ordvec-ffi/Cargo.toml +++ b/ordvec-ffi/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "ordvec-ffi" -version = "0.5.0" +version = "0.6.0" edition = "2021" rust-version = "1.89" publish = false diff --git a/ordvec-manifest-python/Cargo.toml b/ordvec-manifest-python/Cargo.toml index 6b70394c..490ef708 100644 --- a/ordvec-manifest-python/Cargo.toml +++ b/ordvec-manifest-python/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "ordvec-manifest-python" -version = "0.5.0" +version = "0.6.0" edition = "2021" rust-version = "1.89" description = "Python bindings for ordvec-manifest index provenance verification" diff --git a/ordvec-manifest-python/pyproject.toml b/ordvec-manifest-python/pyproject.toml index 090b4b75..1e2f0998 100644 --- a/ordvec-manifest-python/pyproject.toml +++ b/ordvec-manifest-python/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "maturin" [project] name = "ordvec-manifest" -version = "0.5.0" +version = "0.6.0" description = "Python bindings for ordvec index manifest verification" readme = "README.md" requires-python = ">=3.10" diff --git a/ordvec-manifest-python/python/ordvec_manifest/__init__.py b/ordvec-manifest-python/python/ordvec_manifest/__init__.py index 20e77dd9..4d9790c5 100644 --- a/ordvec-manifest-python/python/ordvec_manifest/__init__.py +++ b/ordvec-manifest-python/python/ordvec_manifest/__init__.py @@ -50,4 +50,4 @@ "create_manifest", ] -__version__ = "0.5.0" +__version__ = "0.6.0" diff --git a/ordvec-manifest/Cargo.toml b/ordvec-manifest/Cargo.toml index b00029a7..88c6de0b 100644 --- a/ordvec-manifest/Cargo.toml +++ b/ordvec-manifest/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "ordvec-manifest" -version = "0.5.0" +version = "0.6.0" edition = "2021" rust-version = "1.89" license = "MIT OR Apache-2.0" @@ -29,7 +29,7 @@ required-features = ["cli"] chrono = { version = "0.4.44", default-features = false, features = ["clock", "std"] } clap = { version = "4.6.1", features = ["derive"], optional = true } hex = "0.4.3" -ordvec = { version = "0.5.0", path = ".." } +ordvec = { version = "0.6.0", path = ".." } rusqlite = { version = "0.40.0", optional = true } serde = { version = "1.0", features = ["derive"] } serde_json = "1.0" diff --git a/ordvec-manifest/README.md b/ordvec-manifest/README.md index 1e3952bb..6a58c54e 100644 --- a/ordvec-manifest/README.md +++ b/ordvec-manifest/README.md @@ -154,11 +154,18 @@ Stable limit codes are part of the contract: (`row_identity_duplicate_tracking_limit_exceeded`); - auxiliary artifact declarations: 1,024 (`auxiliary_artifact_count_limit_exceeded`); -- auxiliary artifact bytes per declared file: 64 MiB +- auxiliary artifact bytes per declared file: bounded by the + manifest-declared `file_size_bytes` on verify and by the observed file + size on create; the flat cap is an opt-in ceiling, unbounded by default (`auxiliary_artifact_file_too_large`); -- calibration profile artifact bytes: 64 MiB +- primary index artifact bytes: bounded by the manifest-declared + `file_size_bytes` on verify; the flat cap is an opt-in ceiling, unbounded + by default (`artifact_file_too_large`); +- calibration profile artifact bytes: bounded by the declared + `file_size_bytes`; flat cap opt-in, unbounded by default (`calibration_profile_too_large`); -- encoder distortion profile artifact bytes: 64 MiB +- encoder distortion profile artifact bytes: bounded by the declared + `file_size_bytes`; flat cap opt-in, unbounded by default (`encoder_distortion_profile_too_large`); - collected report issues: 1,024, after which a `verification_report_issue_limit_exceeded` issue is emitted; @@ -168,7 +175,7 @@ The CLI exposes matching override flags on `inspect`, `verify`, `create`, `sqlite verify`, and `sqlite activate`: `--max-manifest-bytes`, `--max-row-map-line-bytes`, `--max-row-map-rows`, `--max-row-map-tracked-id-bytes`, `--max-auxiliary-artifacts`, -`--max-auxiliary-artifact-bytes`, +`--max-auxiliary-artifact-bytes`, `--max-index-artifact-bytes`, `--max-calibration-profile-bytes`, `--max-encoder-distortion-profile-bytes`, `--max-report-issues`, and `--max-cached-report-bytes`. Library callers can override the same ceilings @@ -184,6 +191,7 @@ Stable limit codes: | row-identity duplicate-tracking `db_id` bytes | `row_identity_duplicate_tracking_limit_exceeded` | `row_identity_duplicate_tracking_limit_exceeded` | | auxiliary artifact declarations | `auxiliary_artifact_count_limit_exceeded` | n/a | | auxiliary artifact bytes per declared file | `auxiliary_artifact_file_too_large` | n/a | +| primary index artifact bytes | `artifact_file_too_large` | n/a | | calibration profile artifact bytes | `calibration_profile_too_large` | n/a | | encoder distortion profile artifact bytes | `encoder_distortion_profile_too_large` | n/a | | collected verification report issues | `verification_report_issue_limit_exceeded` | n/a | diff --git a/ordvec-manifest/src/lib.rs b/ordvec-manifest/src/lib.rs index be25a5f1..5a5ec4de 100644 --- a/ordvec-manifest/src/lib.rs +++ b/ordvec-manifest/src/lib.rs @@ -36,9 +36,14 @@ pub const DEFAULT_MAX_ROW_IDENTITY_JSONL_LINE_BYTES: usize = 64 * 1024; pub const DEFAULT_MAX_ROW_IDENTITY_ROWS: usize = 10_000_000; pub const DEFAULT_MAX_ROW_IDENTITY_TRACKED_DB_ID_BYTES: usize = 64 * 1024 * 1024; pub const DEFAULT_MAX_AUXILIARY_ARTIFACTS: usize = 1024; -pub const DEFAULT_MAX_AUXILIARY_ARTIFACT_BYTES: u64 = 64 * 1024 * 1024; -pub const DEFAULT_MAX_CALIBRATION_PROFILE_BYTES: u64 = 64 * 1024 * 1024; -pub const DEFAULT_MAX_ENCODER_DISTORTION_PROFILE_BYTES: u64 = 64 * 1024 * 1024; +/// Artifact-file reads are bounded by the manifest-declared size on verify +/// and by the observed file size on create; these flat caps are opt-in +/// ceilings and default to unbounded. Streaming hashing keeps memory +/// constant regardless of artifact size. +pub const DEFAULT_MAX_AUXILIARY_ARTIFACT_BYTES: u64 = u64::MAX; +pub const DEFAULT_MAX_INDEX_ARTIFACT_BYTES: u64 = u64::MAX; +pub const DEFAULT_MAX_CALIBRATION_PROFILE_BYTES: u64 = u64::MAX; +pub const DEFAULT_MAX_ENCODER_DISTORTION_PROFILE_BYTES: u64 = u64::MAX; pub const DEFAULT_MAX_REPORT_ISSUES: usize = 1024; pub const DEFAULT_MAX_CACHED_REPORT_BYTES: u64 = 4 * 1024 * 1024; @@ -253,7 +258,19 @@ fn verify_manifest_with_path_capture( ) { paths.artifact_path = Some(resolved.canonical_path.clone()); report.artifact.canonical_path = Some(path_to_display(&resolved.canonical_path)); - match sha256_file(&resolved.canonical_path) { + // Bound the read by the manifest-declared size: a primary artifact + // larger than its declaration fails fast instead of being hashed in + // full (the read was previously unbounded). + match sha256_file_bounded( + &resolved.canonical_path, + document + .manifest + .artifact + .file_size_bytes + .min(options.limits.max_index_artifact_bytes), + "artifact_file_too_large", + "index artifact", + ) { Ok(hash) => { report.artifact.sha256 = Some(hash.sha256.clone()); report.artifact.size_bytes = Some(hash.size_bytes); @@ -276,6 +293,7 @@ fn verify_manifest_with_path_capture( ); } } + Err(ManifestError::LimitExceeded { code, message }) => report.error(code, message), Err(err) => report.error( "artifact_hash_failed", format!("failed to hash artifact: {err}"), @@ -345,6 +363,12 @@ fn validate_manifest_shape( "artifact.sha256 must be a lowercase 64-character hex SHA-256 digest", ); } + if manifest.artifact.file_size_bytes == 0 { + report.error( + "artifact_file_size_zero", + "artifact.file_size_bytes must be greater than zero", + ); + } if manifest.artifact.bytes_per_vec == 0 { report.error( "artifact_bytes_per_vec_zero", @@ -547,6 +571,17 @@ fn validate_auxiliary_artifact_shape( ), ); } + // Optional artifacts may legitimately be declared absent with a + // zero-size placeholder (see `AuxiliaryArtifactState::OptionalAbsent`); + // only required declarations must carry a real size. + if artifact.required && artifact.file_size_bytes == 0 { + report.error( + "auxiliary_artifact_file_size_zero", + format!( + "required auxiliary artifact {name:?} file_size_bytes must be greater than zero" + ), + ); + } } } @@ -1223,7 +1258,9 @@ fn validate_encoder_distortion_profile_artifact( Some(path_to_display(&resolved.canonical_path)); match sha256_file_bounded( &resolved.canonical_path, - options.limits.max_encoder_distortion_profile_bytes, + profile + .file_size_bytes + .min(options.limits.max_encoder_distortion_profile_bytes), "encoder_distortion_profile_too_large", "encoder distortion profile", ) { @@ -1669,7 +1706,9 @@ fn validate_calibration_profile( Some(path_to_display(&resolved.canonical_path)); match sha256_file_bounded( &resolved.canonical_path, - options.limits.max_calibration_profile_bytes, + profile + .file_size_bytes + .min(options.limits.max_calibration_profile_bytes), "calibration_profile_too_large", "calibration profile", ) { @@ -1936,9 +1975,14 @@ fn verify_auxiliary_artifacts( AuxiliaryPathResolution::Resolved(resolved) => { captured_path = Some(resolved.canonical_path.clone()); entry.canonical_path = Some(path_to_display(&resolved.canonical_path)); + // Bound the read by the manifest-declared size (the manifest + // is the trust anchor; the SHA-256 pins content). A flat + // limit, when explicitly configured, remains a ceiling. match sha256_file_bounded( &resolved.canonical_path, - options.limits.max_auxiliary_artifact_bytes, + artifact + .file_size_bytes + .min(options.limits.max_auxiliary_artifact_bytes), "auxiliary_artifact_file_too_large", "auxiliary artifact", ) { @@ -2261,6 +2305,9 @@ pub struct ResourceLimits { pub max_row_identity_tracked_db_id_bytes: usize, pub max_auxiliary_artifacts: usize, pub max_auxiliary_artifact_bytes: u64, + /// Opt-in ceiling for the primary index artifact read (unbounded by + /// default; the manifest-declared size is always the effective bound). + pub max_index_artifact_bytes: u64, pub max_calibration_profile_bytes: u64, pub max_encoder_distortion_profile_bytes: u64, pub max_report_issues: usize, @@ -2276,6 +2323,7 @@ impl Default for ResourceLimits { max_row_identity_tracked_db_id_bytes: DEFAULT_MAX_ROW_IDENTITY_TRACKED_DB_ID_BYTES, max_auxiliary_artifacts: DEFAULT_MAX_AUXILIARY_ARTIFACTS, max_auxiliary_artifact_bytes: DEFAULT_MAX_AUXILIARY_ARTIFACT_BYTES, + max_index_artifact_bytes: DEFAULT_MAX_INDEX_ARTIFACT_BYTES, max_calibration_profile_bytes: DEFAULT_MAX_CALIBRATION_PROFILE_BYTES, max_encoder_distortion_profile_bytes: DEFAULT_MAX_ENCODER_DISTORTION_PROFILE_BYTES, max_report_issues: DEFAULT_MAX_REPORT_ISSUES, @@ -3432,7 +3480,11 @@ pub fn sha256_file(path: impl AsRef) -> io::Result { let mut size_bytes = 0u64; let mut buf = [0u8; 64 * 1024]; loop { - let n = file.read(&mut buf)?; + let n = match file.read(&mut buf) { + Ok(n) => n, + Err(err) if err.kind() == io::ErrorKind::Interrupted => continue, + Err(err) => return Err(err), + }; if n == 0 { break; } @@ -3452,12 +3504,54 @@ pub fn sha256_file_bounded( context: &'static str, ) -> Result { let path = path.as_ref(); - let bytes = read_bounded_file(path, max_bytes, code, context)?; + // Refuse non-regular files BEFORE opening: opening a FIFO read-only + // blocks until a writer connects, and a device node would stream + // forever under a large declared-size bound. Regular files terminate + // at EOF and are post-checked against the declaration. (A path swapped + // to a special file after this check is local-actor mutation, out of + // scope per the threat model.) + let metadata = fs::metadata(path)?; + if !metadata.is_file() { + return Err(ManifestError::limit_exceeded( + code, + format!("{context} is not a regular file: {}", path.display()), + )); + } + let mut file = File::open(path)?; let mut hasher = Sha256::new(); - hasher.update(&bytes); + let mut size_bytes = 0u64; + let mut buf = [0u8; 64 * 1024]; + loop { + // Strict bound: never request bytes past max_bytes + 1 (the +1 + // detects exceedance), mirroring read_bounded_file's take() pattern. + let allowance = max_bytes.saturating_add(1) - size_bytes; + if allowance == 0 { + break; + } + let want = allowance.min(buf.len() as u64) as usize; + let n = match file.read(&mut buf[..want]) { + Ok(n) => n, + Err(err) if err.kind() == io::ErrorKind::Interrupted => continue, + Err(err) => return Err(err.into()), + }; + if n == 0 { + break; + } + size_bytes += n as u64; + if size_bytes > max_bytes { + return Err(ManifestError::limit_exceeded( + code, + format!( + "{context} exceeds {max_bytes} bytes while reading {}", + path.display() + ), + )); + } + hasher.update(&buf[..n]); + } Ok(FileHash { sha256: hex::encode(hasher.finalize()), - size_bytes: bytes.len() as u64, + size_bytes, }) } @@ -3514,7 +3608,24 @@ pub fn create_manifest_for_index_with_options( fs::create_dir_all(out_base)?; } let metadata = probe_index_metadata(index_path)?; - let index_hash = sha256_file(index_path)?; + let index_hash = sha256_file_bounded( + index_path, + metadata + .file_size_bytes + .min(options.limits.max_index_artifact_bytes), + "artifact_file_too_large", + "index artifact", + )?; + // One consistent snapshot: the manifest records the byte count that was + // actually hashed, and any change between the metadata probe and the + // hash (concurrent writer) fails loudly instead of embedding a + // size/digest pair describing different bytes. + if index_hash.size_bytes != metadata.file_size_bytes { + return Err(ManifestError::invalid(format!( + "index artifact changed during manifest creation: probed {} bytes, hashed {} bytes", + metadata.file_size_bytes, index_hash.size_bytes + ))); + } let kind = ManifestIndexKind::try_from_core(metadata.kind) .map_err(|err| ManifestError::invalid(err.message()))?; let params = ManifestIndexParams::try_from_core(metadata.params) @@ -3528,7 +3639,7 @@ pub fn create_manifest_for_index_with_options( vector_count: metadata.vector_count, bytes_per_vec: metadata.bytes_per_vec, params, - file_size_bytes: metadata.file_size_bytes, + file_size_bytes: index_hash.size_bytes, }; let row_identity = match row_identity { @@ -3648,9 +3759,15 @@ fn create_auxiliary_artifacts( "auxiliary artifact name {name:?} is duplicated" ))); } + // Create is a trusted context: bound the read by the artifact's own + // observed size (catching mid-hash growth), not a flat cap. An + // explicitly configured flat limit still applies as a ceiling. + let observed_len = fs::metadata(&artifact.path) + .map_err(ManifestError::from)? + .len(); let hash = sha256_file_bounded( &artifact.path, - options.limits.max_auxiliary_artifact_bytes, + observed_len.min(options.limits.max_auxiliary_artifact_bytes), "auxiliary_artifact_file_too_large", "auxiliary artifact", )?; diff --git a/ordvec-manifest/src/main.rs b/ordvec-manifest/src/main.rs index 6236878e..02df85c1 100644 --- a/ordvec-manifest/src/main.rs +++ b/ordvec-manifest/src/main.rs @@ -103,7 +103,8 @@ fn parse_auxiliary_artifact_arg(value: &str) -> Result { + assert_eq!(limits.max_index_artifact_bytes, Some(8)); + assert_eq!(limits.resource_limits().max_index_artifact_bytes, 8); + } + _ => panic!("expected verify command"), + } + } } #[cfg(feature = "sqlite")] @@ -174,6 +211,8 @@ struct LimitArgs { #[arg(long)] max_auxiliary_artifact_bytes: Option, #[arg(long)] + max_index_artifact_bytes: Option, + #[arg(long)] max_calibration_profile_bytes: Option, #[arg(long)] max_encoder_distortion_profile_bytes: Option, @@ -204,6 +243,9 @@ impl LimitArgs { if let Some(value) = self.max_auxiliary_artifact_bytes { limits.max_auxiliary_artifact_bytes = value; } + if let Some(value) = self.max_index_artifact_bytes { + limits.max_index_artifact_bytes = value; + } if let Some(value) = self.max_calibration_profile_bytes { limits.max_calibration_profile_bytes = value; } diff --git a/ordvec-manifest/src/sqlite.rs b/ordvec-manifest/src/sqlite.rs index 6368f9f3..6606c10a 100644 --- a/ordvec-manifest/src/sqlite.rs +++ b/ordvec-manifest/src/sqlite.rs @@ -1,8 +1,7 @@ use crate::{ - resolve_existing_path, sha256_file, sha256_file_bounded, validate_jsonl_rows, - verify_auxiliary_artifacts, verify_manifest, AuxiliaryArtifactState, ManifestDocument, - ManifestError, ReportIssue, ResourceLimits, RowIdentity, VerificationPathCapture, - VerificationReport, VerifyOptions, + resolve_existing_path, sha256_file_bounded, validate_jsonl_rows, verify_auxiliary_artifacts, + verify_manifest, AuxiliaryArtifactState, ManifestDocument, ManifestError, ReportIssue, + ResourceLimits, RowIdentity, VerificationPathCapture, VerificationReport, VerifyOptions, }; use chrono::{SecondsFormat, Utc}; use rusqlite::{params, Connection, OptionalExtension}; @@ -399,7 +398,18 @@ fn current_cache_key( ) else { return Ok(None); }; - let artifact_sha256 = match sha256_file(&artifact.canonical_path) { + // Bound the cache-key hash exactly like the verify path: declared size + // with the opt-in ceiling. A bound violation just misses the cache. + let artifact_sha256 = match sha256_file_bounded( + &artifact.canonical_path, + document + .manifest + .artifact + .file_size_bytes + .min(options.limits.max_index_artifact_bytes), + "artifact_file_too_large", + "index artifact", + ) { Ok(hash) => hash.sha256, Err(_) => return Ok(None), }; @@ -618,7 +628,9 @@ fn current_calibration_profile_sha256( }; match sha256_file_bounded( &resolved.canonical_path, - options.limits.max_calibration_profile_bytes, + profile + .file_size_bytes + .min(options.limits.max_calibration_profile_bytes), "calibration_profile_too_large", "calibration profile", ) { @@ -652,7 +664,9 @@ fn current_encoder_distortion_profile_sha256( }; match sha256_file_bounded( &resolved.canonical_path, - options.limits.max_encoder_distortion_profile_bytes, + profile + .file_size_bytes + .min(options.limits.max_encoder_distortion_profile_bytes), "encoder_distortion_profile_too_large", "encoder distortion profile", ) { diff --git a/ordvec-manifest/tests/derived_limits.rs b/ordvec-manifest/tests/derived_limits.rs new file mode 100644 index 00000000..c08aa18c --- /dev/null +++ b/ordvec-manifest/tests/derived_limits.rs @@ -0,0 +1,248 @@ +//! Derived artifact size bounds: create bounds reads by the artifact's own +//! observed size, verify bounds reads by the manifest-declared size. The flat +//! `ResourceLimits` byte caps remain enforceable as explicit opt-in ceilings +//! but no longer reject large legitimate artifacts by default. + +use ordvec::RankQuant; +use ordvec_manifest::{ + create_manifest_for_index, create_manifest_for_index_with_options, verify_manifest_with_base, + CreateAuxiliaryArtifact, CreateManifestOptions, CreateRowIdentity, VerificationReport, + VerifyOptions, +}; +use std::fs; +use std::fs::OpenOptions; +use std::io::Write; +use std::path::{Path, PathBuf}; + +const LEGACY_AUX_CAP: u64 = 64 * 1024 * 1024; + +fn write_index(dir: &Path) -> PathBuf { + let path = dir.join("index.ovrq"); + let mut index = RankQuant::new(16, 2); + let docs: Vec = (0..32).map(|i| i as f32 - 12.0).collect(); + index.add(&docs); + index.write(&path).unwrap(); + path +} + +fn error_codes(report: &VerificationReport) -> Vec<&str> { + report + .errors + .iter() + .map(|issue| issue.code.as_str()) + .collect() +} + +fn create_with_aux(dir: &Path, aux_path: &Path) -> (ordvec_manifest::IndexManifest, PathBuf) { + let index = write_index(dir); + let manifest_path = dir.join("manifest.json"); + let manifest = create_manifest_for_index_with_options( + &index, + CreateRowIdentity::RowIdIdentity, + "test-embedding", + &manifest_path, + CreateManifestOptions { + auxiliary_artifacts: vec![CreateAuxiliaryArtifact { + name: "sidecar".to_string(), + path: aux_path.to_path_buf(), + required: true, + }], + ..CreateManifestOptions::default() + }, + ) + .unwrap(); + (manifest, manifest_path) +} + +/// Default options must accept auxiliary artifacts larger than the legacy +/// 64 MiB flat cap, end to end: create records the artifact, verify passes. +/// (A 1.26M-row dim=1024 sign sidecar is ~161 MB; the default cap made such +/// bundles impossible to write.) +#[test] +fn default_limits_accept_aux_artifact_above_legacy_cap() { + let temp = tempfile::tempdir().unwrap(); + let aux_path = temp.path().join("sidecar.bin"); + let aux_len = LEGACY_AUX_CAP + 4096; + let file = fs::File::create(&aux_path).unwrap(); + file.set_len(aux_len).unwrap(); + drop(file); + + let (manifest, _) = create_with_aux(temp.path(), &aux_path); + assert_eq!(manifest.auxiliary_artifacts.len(), 1); + assert_eq!(manifest.auxiliary_artifacts[0].file_size_bytes, aux_len); + + let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default()); + assert_eq!( + error_codes(&report), + Vec::<&str>::new(), + "expected clean verification for a {aux_len}-byte auxiliary artifact under defaults", + ); +} + +/// An auxiliary artifact that grew after manifest creation must be rejected +/// by the declared-size read bound (fail-fast, without hashing the excess), +/// keeping the established `auxiliary_artifact_file_too_large` reason code. +#[test] +fn verify_bounds_aux_read_by_declared_size_when_grown() { + let temp = tempfile::tempdir().unwrap(); + let aux_path = temp.path().join("sidecar.bin"); + fs::write(&aux_path, vec![7u8; 8192]).unwrap(); + + let (manifest, _) = create_with_aux(temp.path(), &aux_path); + + let mut file = OpenOptions::new().append(true).open(&aux_path).unwrap(); + file.write_all(&[7u8; 4096]).unwrap(); + drop(file); + + let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default()); + assert!( + error_codes(&report).contains(&"auxiliary_artifact_file_too_large"), + "grown artifact must fail the declared-size bound, got {:?}", + error_codes(&report), + ); + assert_eq!( + report.auxiliary_artifacts[0].reason_code.as_deref(), + Some("auxiliary_artifact_file_too_large"), + ); +} + +/// Regression guard: a truncated auxiliary artifact still fails verification +/// (size mismatch below the declared bound; the bound itself must not +/// misclassify a smaller-than-declared file). +#[test] +fn verify_rejects_truncated_aux_artifact() { + let temp = tempfile::tempdir().unwrap(); + let aux_path = temp.path().join("sidecar.bin"); + fs::write(&aux_path, vec![7u8; 8192]).unwrap(); + + let (manifest, _) = create_with_aux(temp.path(), &aux_path); + let file = OpenOptions::new().write(true).open(&aux_path).unwrap(); + file.set_len(4096).unwrap(); + drop(file); + + let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default()); + assert!( + error_codes(&report).contains(&"auxiliary_artifact_file_size_mismatch"), + "truncated artifact must fail size equality, got {:?}", + error_codes(&report), + ); +} + +/// Regression guard: a manifest whose declared auxiliary size was inflated +/// (bytes on disk unchanged) still fails the size-equality check even though +/// the SHA-256 matches. +#[test] +fn verify_rejects_inflated_declared_aux_size() { + let temp = tempfile::tempdir().unwrap(); + let aux_path = temp.path().join("sidecar.bin"); + fs::write(&aux_path, vec![7u8; 8192]).unwrap(); + + let (mut manifest, _) = create_with_aux(temp.path(), &aux_path); + manifest.auxiliary_artifacts[0].file_size_bytes = 1 << 30; + + let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default()); + assert!( + error_codes(&report).contains(&"auxiliary_artifact_file_size_mismatch"), + "inflated declaration must fail size equality, got {:?}", + error_codes(&report), + ); +} + +/// An explicitly configured flat cap remains an enforceable ceiling on +/// verify even when the declared size is within bounds. +#[test] +fn explicit_flat_cap_still_enforced_on_verify() { + let temp = tempfile::tempdir().unwrap(); + let aux_path = temp.path().join("sidecar.bin"); + fs::write(&aux_path, vec![7u8; 8192]).unwrap(); + + let (manifest, _) = create_with_aux(temp.path(), &aux_path); + let mut options = VerifyOptions::default(); + options.limits.max_auxiliary_artifact_bytes = 4096; + + let report = verify_manifest_with_base(manifest, temp.path(), options); + assert!( + error_codes(&report).contains(&"auxiliary_artifact_file_too_large"), + "explicit tight cap must still reject, got {:?}", + error_codes(&report), + ); +} + +/// The primary index artifact gains a declared-size read bound: a primary +/// artifact that grew after manifest creation fails fast with a dedicated +/// reason code instead of being hashed in full. +#[test] +fn verify_bounds_primary_read_by_declared_size_when_grown() { + let temp = tempfile::tempdir().unwrap(); + let index = write_index(temp.path()); + let manifest_path = temp.path().join("manifest.json"); + let manifest = create_manifest_for_index( + &index, + CreateRowIdentity::RowIdIdentity, + "test-embedding", + &manifest_path, + ) + .unwrap(); + + let mut file = OpenOptions::new().append(true).open(&index).unwrap(); + file.write_all(&[0u8; 4096]).unwrap(); + drop(file); + + let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default()); + assert!( + error_codes(&report).contains(&"artifact_file_too_large"), + "grown primary artifact must fail the declared-size bound, got {:?}", + error_codes(&report), + ); +} + +/// The primary index artifact honors an explicitly configured opt-in +/// ceiling, mirroring the auxiliary/profile artifact classes (CIPHER-02). +#[test] +fn explicit_index_ceiling_enforced_on_primary() { + let temp = tempfile::tempdir().unwrap(); + let index = write_index(temp.path()); + let manifest_path = temp.path().join("manifest.json"); + let manifest = create_manifest_for_index( + &index, + CreateRowIdentity::RowIdIdentity, + "test-embedding", + &manifest_path, + ) + .unwrap(); + + let mut options = VerifyOptions::default(); + options.limits.max_index_artifact_bytes = 8; + + let report = verify_manifest_with_base(manifest, temp.path(), options); + assert!( + error_codes(&report).contains(&"artifact_file_too_large"), + "explicit index ceiling must reject, got {:?}", + error_codes(&report), + ); +} + +/// Non-regular files must be refused before hashing: a FIFO would stream +/// forever under a large declared-size bound (CIPHER-001). +#[cfg(unix)] +#[test] +fn verify_refuses_non_regular_artifact_files() { + let temp = tempfile::tempdir().unwrap(); + let aux_path = temp.path().join("sidecar.bin"); + fs::write(&aux_path, vec![7u8; 512]).unwrap(); + let (manifest, _) = create_with_aux(temp.path(), &aux_path); + + fs::remove_file(&aux_path).unwrap(); + let status = std::process::Command::new("mkfifo") + .arg(&aux_path) + .status() + .unwrap(); + assert!(status.success()); + + let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default()); + assert!( + error_codes(&report).contains(&"auxiliary_artifact_file_too_large"), + "FIFO artifact must be refused, got {:?}", + error_codes(&report), + ); +} diff --git a/ordvec-manifest/tests/manifest.rs b/ordvec-manifest/tests/manifest.rs index 3a583e27..71f9b57a 100644 --- a/ordvec-manifest/tests/manifest.rs +++ b/ordvec-manifest/tests/manifest.rs @@ -2272,19 +2272,17 @@ fn verify_for_load_fails_closed_with_report_for_corrupted_artifact() { serde_json::to_string_pretty(&manifest).unwrap(), ) .unwrap(); - fs::OpenOptions::new() - .append(true) - .open(&index) - .unwrap() - .write_all(b"\0") - .unwrap(); + // Corrupt in place (same size): the declared-size read bound is + // satisfied, so verification proceeds to the digest and fails there. + let mut bytes = fs::read(&index).unwrap(); + bytes[0] ^= 0xFF; + fs::write(&index, &bytes).unwrap(); let err = verify_for_load(&manifest_path, VerifyOptions::default()).unwrap_err(); let VerifiedLoadPlanError::VerificationFailed(report) = err else { panic!("expected verification failure"); }; assert!(error_codes(&report).contains(&"artifact_sha256_mismatch")); - assert!(error_codes(&report).contains(&"artifact_file_size_mismatch")); } #[test] @@ -2328,8 +2326,9 @@ fn verify_for_load_plan_is_not_a_byte_pin() { let VerifiedLoadPlanError::VerificationFailed(report) = err else { panic!("expected verification failure"); }; - assert!(error_codes(&report).contains(&"artifact_sha256_mismatch")); - assert!(error_codes(&report).contains(&"artifact_file_size_mismatch")); + // The artifact grew past its declared size, so re-verification fails + // fast at the declared-size read bound. + assert!(error_codes(&report).contains(&"artifact_file_too_large")); } #[test] @@ -2640,6 +2639,49 @@ fn auxiliary_artifacts_fail_closed_on_tamper_missing_and_path_escape() { .ends_with("missing.bin")); } +#[test] +fn manifest_shape_rejects_zero_declared_file_sizes_for_required_artifacts() { + let root = tempfile::tempdir().unwrap(); + let (temp, mut manifest, _manifest_path) = identity_manifest(root.path()); + fs::write(temp.path().join("extra.bin"), b"extra").unwrap(); + let extra_hash = sha256_file(temp.path().join("extra.bin")).unwrap(); + + manifest.artifact.file_size_bytes = 0; + manifest.auxiliary_artifacts = vec![AuxiliaryArtifact { + name: "extra".to_string(), + path: "extra.bin".to_string(), + sha256: extra_hash.sha256, + file_size_bytes: 0, + required: true, + }]; + + let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default()); + assert!(!report.ok); + let codes = error_codes(&report); + assert!(codes.contains(&"artifact_file_size_zero"), "{codes:?}"); + assert!( + codes.contains(&"auxiliary_artifact_file_size_zero"), + "{codes:?}" + ); +} + +#[test] +fn optional_absent_zero_size_placeholder_is_not_flagged_zero_size() { + let root = tempfile::tempdir().unwrap(); + let (temp, mut manifest, _manifest_path) = identity_manifest(root.path()); + manifest.auxiliary_artifacts = vec![AuxiliaryArtifact { + name: "optional-model".to_string(), + path: "missing-model.json".to_string(), + sha256: "0".repeat(64), + file_size_bytes: 0, + required: false, + }]; + + let report = verify_manifest_with_base(manifest, temp.path(), VerifyOptions::default()); + assert!(report.ok, "{:?}", report.errors); + assert!(!error_codes(&report).contains(&"auxiliary_artifact_file_size_zero")); +} + #[test] fn auxiliary_artifact_schema_rejects_unknown_fields_and_duplicate_names() { let root = tempfile::tempdir().unwrap(); @@ -4255,3 +4297,70 @@ fn sqlite_cache_key_includes_limits_and_bounds_cached_report_size() { .unwrap_err(); assert_eq!(err.code(), Some("sqlite_cached_report_too_large")); } + +#[test] +fn grown_profiles_fail_fast_at_declared_size_under_default_limits() { + // Derived-limits regression coverage for the two profile call sites: + // a profile grown past its manifest-declared size must fail fast with + // the *_too_large code at DEFAULT options (bound = declared size), not + // be hashed in full and reported as a digest mismatch. + let temp = tempfile::tempdir().unwrap(); + let case = tempfile::tempdir_in(temp.path()).unwrap(); + let profile_dir = case.path().join("profiles"); + fs::create_dir(&profile_dir).unwrap(); + let index = write_index_kind(case.path(), FixtureKind::Bitmap); + let manifest_path = case.path().join("manifest.json"); + let mut manifest = create_manifest_for_index( + &index, + CreateRowIdentity::RowIdIdentity, + "test-embedding", + &manifest_path, + ) + .unwrap(); + + let calibration_path = profile_dir.join("profile.f64"); + let calibration_hash = write_profile( + &calibration_path, + manifest.artifact.dim * std::mem::size_of::(), + ); + manifest.calibration = Some(weighted_calibration( + &manifest, + "profiles/profile.f64", + calibration_hash, + CalibrationOrdinalization::TopK { + dim: manifest.artifact.dim, + k: 16, + }, + ProfileParameterization::MarginalTopKFrequency, + vec![manifest.artifact.dim], + )); + + let distortion_path = profile_dir.join("distortion.json"); + let distortion_hash = write_profile(&distortion_path, 128); + manifest.encoder_distortion = Some(distortion_profile( + &manifest, + Some("profiles/distortion.json".to_string()), + Some(distortion_hash), + DistortionEvidenceKind::EmpiricalSample, + )); + + let report = verify_manifest_with_base(manifest.clone(), case.path(), VerifyOptions::default()); + assert!(report.ok, "{:?}", report.errors); + + // Grow both profile files past their declarations. + for path in [&calibration_path, &distortion_path] { + let mut file = fs::OpenOptions::new().append(true).open(path).unwrap(); + file.write_all(&[0u8; 64]).unwrap(); + } + + let report = verify_manifest_with_base(manifest, case.path(), VerifyOptions::default()); + let codes = error_codes(&report); + assert!( + codes.contains(&"calibration_profile_too_large"), + "grown calibration profile must fail the declared-size bound, got {codes:?}", + ); + assert!( + codes.contains(&"encoder_distortion_profile_too_large"), + "grown encoder distortion profile must fail the declared-size bound, got {codes:?}", + ); +} diff --git a/ordvec-python/Cargo.toml b/ordvec-python/Cargo.toml index 174fe13f..fb0a3cd4 100644 --- a/ordvec-python/Cargo.toml +++ b/ordvec-python/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "ordvec-python" -version = "0.5.0" +version = "0.6.0" edition = "2021" rust-version = "1.89" # inherits ordvec's AVX-512 MSRV floor description = "Python bindings for ordvec — training-free ordinal & sign vector quantization" diff --git a/ordvec-python/pyproject.toml b/ordvec-python/pyproject.toml index d627aa12..79b26f75 100644 --- a/ordvec-python/pyproject.toml +++ b/ordvec-python/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "maturin" [project] name = "ordvec" -version = "0.5.0" +version = "0.6.0" description = "Training-free ordinal & sign quantization for compressed vector retrieval" readme = "README.md" requires-python = ">=3.10" diff --git a/ordvec-python/python/ordvec/__init__.py b/ordvec-python/python/ordvec/__init__.py index 4726b895..5cfa2911 100644 --- a/ordvec-python/python/ordvec/__init__.py +++ b/ordvec-python/python/ordvec/__init__.py @@ -115,4 +115,4 @@ "SignBitmapIndex", ] -__version__ = "0.5.0" +__version__ = "0.6.0" diff --git a/src/quant.rs b/src/quant.rs index 1c8ce627..321f748b 100644 --- a/src/quant.rs +++ b/src/quant.rs @@ -33,7 +33,7 @@ use crate::quant_kernels::{ scan_b2_asym_avx2, scan_b2_asym_avx512, scan_b4_asym_avx2, scan_b4_asym_avx512, }; use crate::rank::{ - bucket_centre, bucket_ranks, pack_buckets, rank_to_bucket, rank_transform, + bucket_centre, bucket_ranks, pack_buckets, rank_to_bucket, rank_transform, rank_transform_into, rankquant_bytes_per_vec, rankquant_norm, }; use crate::sign_bitmap::SignBitmap; @@ -601,12 +601,15 @@ impl RankQuant { self.packed[start..] .par_chunks_mut(bytes_per_vec) .zip(vectors.par_chunks(dim)) - .for_each(|(out, v)| { - let ranks = rank_transform(v); - let buckets = bucket_ranks(&ranks, bits); - let packed = pack_buckets(&buckets, bits); - out.copy_from_slice(&packed); - }); + .for_each_init( + || vec![0u16; dim], + |ranks, (out, v)| { + rank_transform_into(v, ranks); + let buckets = bucket_ranks(ranks, bits); + let packed = pack_buckets(&buckets, bits); + out.copy_from_slice(&packed); + }, + ); self.n_vectors = new_n; } diff --git a/src/rank_io.rs b/src/rank_io.rs index e05505c8..d0b29572 100644 --- a/src/rank_io.rs +++ b/src/rank_io.rs @@ -765,8 +765,39 @@ fn load_rankquant_from_stream( let expected_per_bucket = dim / n_buckets; let mask = (1u8 << bits) - 1; let bits_u = bits as usize; - for (row_idx, row) in packed.chunks_exact(bytes_per_row).enumerate() { - let mut hist = [0usize; 16]; // n_buckets <= 2^4 = 16 + // Per-byte bucket-count LUT: byte value -> how many of its packed codes + // land in each bucket. Replaces the per-code shift/mask loop (dim ops + // per row) with bytes_per_row table lookups, and rows check in parallel + // (they are independent). `find_first` preserves the serial contract of + // reporting the lowest offending row. + let mut lut = [[0u8; 16]; 256]; + for (byte, counts) in lut.iter_mut().enumerate() { + for slot in 0..codes_per_byte { + let shift = (codes_per_byte - 1 - slot) * bits_u; + counts[((byte as u8 >> shift) & mask) as usize] += 1; + } + } + let row_is_valid = |row: &[u8]| { + let mut hist = [0u16; 16]; + for &byte in row { + let counts = &lut[byte as usize]; + for bucket in 0..n_buckets { + hist[bucket] += u16::from(counts[bucket]); + } + } + hist[..n_buckets] + .iter() + .all(|&count| count as usize == expected_per_bucket) + }; + use rayon::prelude::*; + let first_bad = (0..n_vectors).into_par_iter().find_first(|&row_idx| { + !row_is_valid(&packed[row_idx * bytes_per_row..(row_idx + 1) * bytes_per_row]) + }); + if let Some(row_idx) = first_bad { + // Rerun the scalar histogram on the offending row for the exact + // bucket/count in the error message. + let row = &packed[row_idx * bytes_per_row..(row_idx + 1) * bytes_per_row]; + let mut hist = [0usize; 16]; for &byte in row { for slot in 0..codes_per_byte { let shift = (codes_per_byte - 1 - slot) * bits_u; @@ -781,6 +812,7 @@ fn load_rankquant_from_stream( ))); } } + unreachable!("row {row_idx} failed the LUT check but passed the scalar recheck"); } Ok((bits, dim, n_vectors, packed)) } diff --git a/src/sign_bitmap.rs b/src/sign_bitmap.rs index 66f971ab..649443d0 100644 --- a/src/sign_bitmap.rs +++ b/src/sign_bitmap.rs @@ -39,6 +39,7 @@ //! scalar path. See [`crate::avx512vpop_supported`]. use rayon::prelude::*; +use std::collections::BinaryHeap; use crate::OrdvecError; @@ -220,6 +221,112 @@ impl SignBitmap { /// SIMD dispatch paths — same audit discipline as /// [`crate::Bitmap::top_m_candidates`]. #[must_use = "this scans the corpus to generate candidates; dropping the result discards that work"] + /// Streamed exact top-m selection shared by [`Self::top_m_candidates`] + /// and [`Self::top_m_candidates_batched_serial_csr`]: the corpus is + /// scanned once per call in L2-sized doc blocks, each hot block is + /// scored against every query (in small query tiles), and per-query + /// bounded min-m collectors keyed by `(hamming, doc_id)` select exactly + /// the lexicographic top-m — bit-identical to a full sort, independent + /// of processing order. Serial by contract: no rayon. + fn top_m_candidates_streamed(&self, queries: &[f32], m_eff: usize) -> Vec> { + const TILE_QUERIES: usize = 32; + const BLOCK_BYTES: usize = 256 * 1024; + + let dim = self.dim; + debug_assert!( + queries.len().is_multiple_of(dim), + "queries buffer must be a whole number of rows" + ); + let nq = queries.len() / dim; + let qpv = self.qwords_per_vec; + let n = self.n_vectors; + debug_assert!(m_eff >= 1 && m_eff <= n); + + // Build bitmaps in place: the entry points already validated the + // whole query buffer, and build_query_bitmap would allocate a fresh + // Vec (and re-validate) per query on this hot path. + let mut q_bitmaps = vec![0u64; nq * qpv]; + for qi in 0..nq { + let q = &queries[qi * dim..(qi + 1) * dim]; + let bm = &mut q_bitmaps[qi * qpv..(qi + 1) * qpv]; + for (j, &value) in q.iter().enumerate() { + if value > 0.0 { + bm[j / 64] |= 1u64 << (j % 64); + } + } + } + + let block_docs = (BLOCK_BYTES / (qpv * 8)).max(64).min(n); + let tile = TILE_QUERIES.min(nq); + let mut block_scores = vec![0u32; tile * block_docs]; + // Max-heap keeps the current worst kept key at the top, so the + // retained set is always the m lexicographically smallest + // (hamming, doc_id) keys seen so far. + // Selection state is O(nq * m_eff) on top of the CSR output — an + // explicit checked bound (32-bit/wasm32 targets can overflow the + // multiplication) with a clear message, per the crate's + // checked-allocation discipline. Exact per-heap reservation of + // m_eff + 1 is deliberate: gradual growth would double-allocate to + // the next power of two (~2x m_eff peak per query); callers with + // extreme nq * m_eff should tile the query batch (as OrdinalDB's + // chunk scheduler does). + let selection_cells = nq.checked_mul(m_eff).unwrap_or_else(|| { + panic!("selection state nq ({nq}) * m ({m_eff}) overflows usize; tile the query batch") + }); + let _ = selection_cells; + let mut heaps: Vec> = (0..nq) + .map(|_| BinaryHeap::with_capacity(m_eff + 1)) + .collect(); + // Cached copy of each full heap's worst kept hamming. Doc ids visit + // each heap strictly ascending (d ascends within a row, blocks + // ascend), so a candidate tying the worst hamming always loses the + // (hamming, doc_id) tie-break — once full, the boundary test + // reduces to one u32 compare against this register. u32::MAX while + // filling (hamming <= dim can never reach it). + let mut worst_bounds = vec![u32::MAX; nq]; + + let mut block_start = 0usize; + while block_start < n { + let bn = block_docs.min(n - block_start); + let block = &self.bitmaps[block_start * qpv..(block_start + bn) * qpv]; + let mut tile_start = 0usize; + while tile_start < nq { + let tq = tile.min(nq - tile_start); + let qb_tile = &q_bitmaps[tile_start * qpv..(tile_start + tq) * qpv]; + let scores = &mut block_scores[..tq * bn]; + sign_scan_collect_batched(block, bn, qpv, qb_tile, tq, scores); + for ti in 0..tq { + let heap = &mut heaps[tile_start + ti]; + let worst = &mut worst_bounds[tile_start + ti]; + let row = &scores[ti * bn..(ti + 1) * bn]; + for (d, &hamming) in row.iter().enumerate() { + if hamming >= *worst { + continue; + } + heap.push((hamming, (block_start + d) as u32)); + if heap.len() > m_eff { + heap.pop(); + } + if heap.len() == m_eff { + *worst = heap.peek().expect("full collector").0; + } + } + } + tile_start += tq; + } + block_start += bn; + } + + heaps + .into_iter() + .map(|heap| { + let mut kept = heap.into_vec(); + kept.sort_unstable(); + kept.into_iter().map(|(_, doc)| doc).collect() + }) + .collect() + } + pub fn top_m_candidates(&self, q: &[f32], m: usize) -> Vec { assert_eq!(q.len(), self.dim); crate::util::assert_all_finite(q); @@ -227,6 +334,10 @@ impl SignBitmap { if m_eff == 0 { return Vec::new(); } + // Single-query stays on the dense partition path: with one query + // there is no scan to share, and select_nth_unstable_by (O(n) + // average) measurably beats an O(n log m) bounded heap for m in the + // hundreds at small/medium n (audit: +50-90% regression otherwise). let qb = self.build_query_bitmap(q); let mut scores = vec![0u32; self.n_vectors]; // Hamming distance per doc sign_scan_collect( @@ -313,10 +424,17 @@ impl SignBitmap { /// pool. (The existing [`Self::top_m_candidates_batched`] remains the /// internally-parallel standalone convenience.) /// - /// Track-1 implementation is intentionally naive — it loops the single-query - /// [`Self::top_m_candidates`] (which materialises a per-query `n` Hamming - /// row). A future release may replace the internals with streaming top-m - /// behind this frozen signature; the CSR output contract will not change. + /// The internals stream the corpus **once per call** in L2-sized doc + /// blocks, scoring every query of the call against each hot block and + /// selecting per-query top-m with bounded `(hamming, doc_id)` collectors + /// — per-query corpus traffic drops by the call's query count relative + /// to the historical per-query rescan. The CSR output contract is + /// unchanged and bit-identical to the previous implementation. + /// + /// "Serial" scopes the scan and selection: no rayon is entered for the + /// candidate work, so callers own that parallelism. Input finite- + /// validation MAY briefly use the global rayon pool for large query + /// buffers (order-independent boolean reduction; deterministic). /// /// # Example /// ```no_run @@ -344,10 +462,17 @@ impl SignBitmap { let m_eff = m.min(self.n_vectors); let mut offsets = Vec::with_capacity(nq + 1); offsets.push(0usize); - let mut candidates = Vec::with_capacity(nq.saturating_mul(m_eff)); - for qi in 0..nq { - let q = &queries[qi * dim..(qi + 1) * dim]; - let row = self.top_m_candidates(q, m); + let mut candidates = Vec::with_capacity(nq.checked_mul(m_eff).unwrap_or_else(|| { + panic!("CSR output nq ({nq}) * m ({m_eff}) overflows usize; tile the query batch") + })); + if nq == 0 || m_eff == 0 { + offsets.extend(std::iter::repeat_n(0usize, nq)); + return CandidateBatch { + candidates, + offsets, + }; + } + for row in self.top_m_candidates_streamed(queries, m_eff) { candidates.extend_from_slice(&row); offsets.push(candidates.len()); } @@ -662,6 +787,59 @@ fn sign_scan_collect_batched( } } +/// Fold eight u64-lane accumulators into one vector holding their eight +/// horizontal sums, in accumulator order: an unpack/permute/shuffle tree +/// (25 vector ops) replacing eight serial `_mm512_reduce_add_epi64` +/// expansions on the per-doc hot path. +#[cfg(target_arch = "x86_64")] +#[target_feature(enable = "avx512f")] +unsafe fn hsum8_epi64_avx512(accs: &[std::arch::x86_64::__m512i; 8]) -> std::arch::x86_64::__m512i { + use std::arch::x86_64::*; + { + // L1: pairwise lane sums, interleaved per source: + // s01 = [a0p01, a1p01, a0p23, a1p23, a0p45, a1p45, a0p67, a1p67] + let s01 = _mm512_add_epi64( + _mm512_unpacklo_epi64(accs[0], accs[1]), + _mm512_unpackhi_epi64(accs[0], accs[1]), + ); + let s23 = _mm512_add_epi64( + _mm512_unpacklo_epi64(accs[2], accs[3]), + _mm512_unpackhi_epi64(accs[2], accs[3]), + ); + let s45 = _mm512_add_epi64( + _mm512_unpacklo_epi64(accs[4], accs[5]), + _mm512_unpackhi_epi64(accs[4], accs[5]), + ); + let s67 = _mm512_add_epi64( + _mm512_unpacklo_epi64(accs[6], accs[7]), + _mm512_unpackhi_epi64(accs[6], accs[7]), + ); + // L2: gather even/odd u64s across pair vectors: + // e01_23 = [a0p01, a0p23, a0p45, a0p67, a2p01, a2p23, a2p45, a2p67] + let even_idx = _mm512_setr_epi64(0, 2, 4, 6, 8, 10, 12, 14); + let odd_idx = _mm512_setr_epi64(1, 3, 5, 7, 9, 11, 13, 15); + let e02 = _mm512_permutex2var_epi64(s01, even_idx, s23); + let o13 = _mm512_permutex2var_epi64(s01, odd_idx, s23); + let e46 = _mm512_permutex2var_epi64(s45, even_idx, s67); + let o57 = _mm512_permutex2var_epi64(s45, odd_idx, s67); + // L3: pairwise again -> + // w1 = [a0p0123, a1p0123, a0p4567, a1p4567, a2p0123, a3p0123, a2p4567, a3p4567] + let w1 = _mm512_add_epi64( + _mm512_unpacklo_epi64(e02, o13), + _mm512_unpackhi_epi64(e02, o13), + ); + let w2 = _mm512_add_epi64( + _mm512_unpacklo_epi64(e46, o57), + _mm512_unpackhi_epi64(e46, o57), + ); + // L4: fold 128-bit blocks: w1 blocks B0=[a0p0123,a1p0123] + // B1=[a0p4567,a1p4567] B2=[a2..],B3 -> sums = B0+B1, B2+B3. + let t = _mm512_shuffle_i64x2(w1, w2, 0b10_00_10_00); + let u = _mm512_shuffle_i64x2(w1, w2, 0b11_01_11_01); + _mm512_add_epi64(t, u) + } +} + #[cfg(target_arch = "x86_64")] #[target_feature(enable = "avx512f,avx512vpopcntdq")] unsafe fn sign_scan_collect_batched_avx512vpop( @@ -734,9 +912,11 @@ unsafe fn sign_scan_collect_batched_avx512vpop( accs[bi] = _mm512_add_epi64(accs[bi], _mm512_popcnt_epi64(xor_zmm)); } } + let sums = hsum8_epi64_avx512(&accs); + let mut sums_arr = [0u64; CHUNK]; + _mm512_storeu_si512(sums_arr.as_mut_ptr() as *mut __m512i, sums); for bi in 0..CHUNK { - let acc_sum: i64 = _mm512_reduce_add_epi64(accs[bi]); - scores[(chunk_start + bi) * n + di] = acc_sum as u32; + scores[(chunk_start + bi) * n + di] = sums_arr[bi] as u32; } } chunk_start += CHUNK; diff --git a/src/util.rs b/src/util.rs index 5f9eb1dd..eecdb2ef 100644 --- a/src/util.rs +++ b/src/util.rs @@ -124,8 +124,18 @@ pub(crate) fn l2_normalise_into(out: &mut Vec, v: &[f32]) { /// validate separately; this is the Rust-side backstop. #[inline] pub(crate) fn assert_all_finite(v: &[f32]) { + // Large ingest batches pay a full serial pass here (measured ~0.1s per + // GiB); split the scan across the pool once it dwarfs the fork cost. + const PARALLEL_THRESHOLD: usize = 1 << 20; + let all_finite = if v.len() >= PARALLEL_THRESHOLD { + use rayon::prelude::*; + v.par_chunks(1 << 18) + .all(|c| c.iter().all(|x| x.is_finite())) + } else { + v.iter().all(|x| x.is_finite()) + }; assert!( - v.iter().all(|x| x.is_finite()), + all_finite, "ordvec: input contains non-finite (NaN or ±Inf) values; embeddings must be finite" ); } diff --git a/tests/tiled_candgen.rs b/tests/tiled_candgen.rs new file mode 100644 index 00000000..33ac4144 --- /dev/null +++ b/tests/tiled_candgen.rs @@ -0,0 +1,175 @@ +//! Contract-pinning tests for sign candidate generation, written ahead of the +//! tiled internals swap of `top_m_candidates` / +//! `top_m_candidates_batched_serial_csr`. The oracle is independent of the +//! implementation under test: `score_all` (dense agreement counts) plus a +//! full lexicographic sort by `(hamming asc, doc_id asc)`. These tests pin +//! today's behavior exactly — including tie handling at the m-th position — +//! and must pass bit-identically before and after the swap. + +use ordvec::SignBitmap; + +/// Deterministic xorshift so corpora are reproducible without a rand dep. +struct XorShift(u64); + +impl XorShift { + fn next_f32(&mut self) -> f32 { + self.0 ^= self.0 << 13; + self.0 ^= self.0 >> 7; + self.0 ^= self.0 << 17; + // Map to [-1, 1) with plenty of sign variety. + ((self.0 >> 40) as f32 / 8_388_608.0) - 1.0 + } +} + +fn random_corpus(dim: usize, n: usize, seed: u64) -> Vec { + let mut rng = XorShift(seed | 1); + (0..n * dim).map(|_| rng.next_f32()).collect() +} + +/// Tie-heavy corpus: every coordinate is +/-1 drawn from a tiny pattern set, +/// so hamming distances collide massively and the (hamming, doc_id) +/// tie-break does real work at the selection boundary. +fn tie_heavy_corpus(dim: usize, n: usize) -> Vec { + (0..n) + .flat_map(|doc| { + let pattern = doc % 4; + (0..dim).map(move |c| if (c + pattern) % 3 == 0 { -1.0 } else { 1.0 }) + }) + .collect() +} + +fn oracle_top_m(sign: &SignBitmap, q: &[f32], m: usize) -> Vec { + let dim_u32 = u32::try_from(q.len()).unwrap(); + // score_all returns agreement (dim - hamming), higher is better. + let agreements = sign.score_all(q); + let mut ids: Vec = (0..agreements.len() as u32).collect(); + ids.sort_by_key(|&i| (dim_u32 - agreements[i as usize], i)); + ids.truncate(m.min(agreements.len())); + ids +} + +fn assert_contract(dim: usize, vectors: &[f32], queries: &[f32], m: usize, label: &str) { + let mut sign = SignBitmap::new(dim); + sign.add(vectors); + let nq = queries.len() / dim; + + // Single-query path. + for qi in 0..nq { + let q = &queries[qi * dim..(qi + 1) * dim]; + let got = sign.top_m_candidates(q, m); + let want = oracle_top_m(&sign, q, m); + assert_eq!( + got, want, + "{label}: single-query mismatch at query {qi}, m={m}" + ); + } + + // Batched serial CSR path: row qi must equal the single-query result. + let cb = sign.top_m_candidates_batched_serial_csr(queries, m); + assert_eq!(cb.offsets.len(), nq + 1, "{label}: CSR offsets length"); + for qi in 0..nq { + let row = &cb.candidates[cb.offsets[qi]..cb.offsets[qi + 1]]; + let want = oracle_top_m(&sign, &queries[qi * dim..(qi + 1) * dim], m); + assert_eq!( + row, + &want[..], + "{label}: CSR row mismatch at query {qi}, m={m}" + ); + } +} + +/// Random corpus large enough to span many doc blocks under any plausible +/// tile size, at a SIMD-friendly dim. +#[test] +fn random_corpus_matches_oracle_across_block_boundaries() { + // dim=512 -> 8 qwords/vec -> 4096-doc blocks; n=10240 spans three + // blocks including a final partial one (audit: the previous dim=128 + // shape fit in a single block, so the loop never crossed a boundary). + let dim = 512; + let n = 10_240; + let vectors = random_corpus(dim, n, 0xC0FFEE); + let queries = random_corpus(dim, 33, 0xBEEF); + for m in [1, 7, 256, 500] { + assert_contract(dim, &vectors, &queries, m, "random"); + } +} + +/// Massive hamming ties: selection at the boundary is decided purely by +/// doc_id ascending. This is the case a streaming collector most easily gets +/// subtly wrong. +#[test] +fn tie_heavy_corpus_selects_lowest_doc_ids_at_boundary() { + let dim = 64; + let n = 4_096; + let vectors = tie_heavy_corpus(dim, n); + let queries = random_corpus(dim, 9, 0xABCD); + for m in [1, 3, 100, 1_000] { + assert_contract(dim, &vectors, &queries, m, "tie-heavy"); + } +} + +/// Exact duplicate documents: every duplicate group is one giant tie run, +/// longer than m, exercising equal-hamming runs that exceed the collector. +#[test] +fn duplicate_documents_tie_runs_longer_than_m() { + let dim = 64; + let base = random_corpus(dim, 8, 0x1234); + // 8 distinct vectors, each repeated 512 times => tie runs of 512. + let mut vectors = Vec::with_capacity(8 * 512 * dim); + for rep in 0..512 { + let _ = rep; + vectors.extend_from_slice(&base); + } + let queries = random_corpus(dim, 5, 0x9999); + for m in [10, 100, 513] { + assert_contract(dim, &vectors, &queries, m, "duplicates"); + } +} + +/// Edge geometry: m >= n, m == n, single doc, single query, nq == 0. +#[test] +fn edge_geometries_match_oracle() { + let dim = 64; + let vectors = random_corpus(dim, 17, 0x42); + let queries = random_corpus(dim, 3, 0x43); + for m in [17, 25, 1] { + assert_contract(dim, &vectors, &queries, m, "edge"); + } + + let single_doc = random_corpus(dim, 1, 0x77); + assert_contract(dim, &single_doc, &queries, 4, "single-doc"); + + // Empty query batch: CSR must be a single zero offset and no candidates. + let mut sign = SignBitmap::new(dim); + sign.add(&vectors); + let cb = sign.top_m_candidates_batched_serial_csr(&[], 8); + assert_eq!(cb.offsets, vec![0]); + assert!(cb.candidates.is_empty()); +} + +/// Large-dim smoke at the shape the arXiv corpus uses (1024 dims), enough +/// rows to cross several L2-sized doc blocks. +#[test] +fn dim_1024_shape_matches_oracle() { + let dim = 1024; + let n = 6_000; + let vectors = random_corpus(dim, n, 0xA5A5); + let queries = random_corpus(dim, 8, 0x5A5A); + for m in [256, 320] { + assert_contract(dim, &vectors, &queries, m, "dim1024"); + } +} + +/// AVX-512 tail residue (dim=768 -> qpv=12, rem=4) composed with +/// multi-block crossing and a final partial block — the kernel-shape case +/// the audit flagged as untested in the permanent suite. +#[test] +fn dim_768_tail_residue_crosses_blocks() { + let dim = 768; + let n = 3_200; // block_docs = 262144/96 = 2730 -> 2 blocks, partial tail + let vectors = random_corpus(dim, n, 0x7E57); + let queries = random_corpus(dim, 7, 0x7E58); + for m in [64, 320] { + assert_contract(dim, &vectors, &queries, m, "dim768-tail"); + } +}