fix: derive artifact read bounds from declared sizes in ordvec-manifest#277
fix: derive artifact read bounds from declared sizes in ordvec-manifest#277Nelson Spence (Fieldnote-Echo) wants to merge 12 commits into
Conversation
Verification now bounds every artifact read by its manifest-declared file_size_bytes (manifest hard-capped at 1 MiB; SHA-256 pins content); creation bounds reads by the observed file size. Flat ResourceLimits byte caps become opt-in ceilings (default unbounded). The primary artifact read, previously unbounded, gains a declared-size bound and the artifact_file_too_large reason code. sha256_file_bounded now streams with constant memory instead of materialising files. Fixes the undocumented 64 MiB auxiliary cap that made sign-sidecar bundles impossible to write past 524,288 rows at dim=1024 (measured on a 1,258,135-row corpus).
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Codex (@codex) review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
PR Summary by QodoDerive artifact read bounds from manifest-declared sizes in ordvec-manifest
AI Description
Diagram
High-Level Assessment
Files changed (6)
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Code Review by Qodo
1.
|
Security-audit remediation (CIPHER-02): the primary artifact read was bounded only by the attacker-declared size with no configurable ceiling, unlike the auxiliary/profile classes — the documented THREAT-QUERY-003 mitigation was silently incomplete for the largest artifact in a bundle. Adds ResourceLimits::max_index_artifact_bytes (default unbounded) applied as a min() with the declared size, and bounds the create-path primary read by its observed size so the code matches the CHANGELOG claim. CLI flag parity follows separately.
Code-review remediation: the sqlite-feature cache-key path duplicated the hashing logic and missed the derived-bound change — the primary artifact hash was fully unbounded, and the calibration/encoder profile hashes were bounded only by the flat caps, which the default flip turned into effectively unbounded reads. All three now use the same declared-size .min(opt-in ceiling) derivation as the verify path; a bound violation is a cache miss. Adds default-limits grown-profile coverage for the calibration and encoder-distortion call sites, closing the per-site test gap that let this slip.
Adversarial audit triage (cipher security agent + code-reviewer agent)Fixed in
Fixed in
Explained / accepted:
Deferred to follow-up issues:
|
Security-review remediation (fleet CIPHER-001): with derived read bounds, a FIFO inside the bundle directory would block verification forever — File::open on a reader-less FIFO blocks, and a device node would stream without EOF under a large declared size. Stat the path before opening and refuse anything that is not a regular file, at every hashing call site (create, verify, sqlite cache). Regular files terminate at EOF and remain post-checked against the declaration. Unix regression test uses a real FIFO.
Bot-review remediation (Qodo, PR #277): the streaming loop surfaced ErrorKind::Interrupted as a hash failure, a reliability regression vs the previous read_to_end which retried EINTR internally.
|
Qodo triage: (1) EINTR retry in the streaming hash — fixed in |
That triage looks correct.
So there’s nothing further to change in this PR snapshot. |
CI caught both halves: the retry from the Qodo remediation had landed on sha256_file (where the io::Error conversion is a useless-conversion lint under current clippy) instead of sha256_file_bounded, the function actually flagged — the retry now lives on both, with the conversion only where the error type changes. And the release-publish invariant correctly refuses [Unreleased] changelog content at an already-released version: this stack is the 0.6.0 work, so ordvec and ordvec-manifest now say 0.6.0 (minor: limit-semantics change, additive APIs, behavioral perf changes).
The release-publish SBOM invariant requires member package versions in lockstep with the root; ordvec-ffi and both python bindings follow the 0.6.0 bump.
Closes the remaining release-publish invariant layers, verified by running tests/release_publish_invariants.py locally to a clean exit: pyproject + __init__ versions in lockstep, the changelog cut as a dated 0.6.0 section (invariant convention: the current version always has a dated section; [Unreleased] stays empty), THREAT_MODEL status line at v0.6.0, and the README quickstart installing 0.6.
…imary shape check Bot-review remediation (Qodo, #283 inline): - create_manifest_for_index_with_options observed the index size twice (probe, then a separate stat for the hash bound) — a concurrent writer could produce a manifest whose size and digest describe different bytes. The hash is now bounded by the probe's size, the manifest records the byte count actually hashed, and any disagreement fails loudly. - sha256_file_bounded could read (not hash) up to one 64KiB chunk past the bound; reads now clamp to max_bytes + 1, mirroring read_bounded_file's take() pattern. - validate_manifest_shape gains artifact_file_size_zero for the primary artifact, matching the profile artifacts' explicit zero rejection instead of surfacing a confusing artifact_file_too_large.
Bot-review remediation (Qodo, #282): --max-index-artifact-bytes wired into ResourceLimits but the create path bounded the primary hash by the probed size alone — the opt-in ceiling was ineffective for create, unlike auxiliary artifacts. Create now mirrors verify: declared/observed size min explicit ceiling.
* fix: derive artifact read bounds from declared sizes in ordvec-manifest Verification now bounds every artifact read by its manifest-declared file_size_bytes (manifest hard-capped at 1 MiB; SHA-256 pins content); creation bounds reads by the observed file size. Flat ResourceLimits byte caps become opt-in ceilings (default unbounded). The primary artifact read, previously unbounded, gains a declared-size bound and the artifact_file_too_large reason code. sha256_file_bounded now streams with constant memory instead of materialising files. Fixes the undocumented 64 MiB auxiliary cap that made sign-sidecar bundles impossible to write past 524,288 rows at dim=1024 (measured on a 1,258,135-row corpus). * test: pin sign candidate-generation contract ahead of tiled internals Independent oracle (score_all + full lexicographic sort by hamming asc, doc_id asc) pins top_m_candidates and top_m_candidates_batched_serial_csr exactly: random corpora across block boundaries, massive-tie and duplicate-run corpora exercising boundary tie-breaks, edge geometries (m >= n, single doc, empty batch), and the dim=1024 shape. Must pass bit-identically before and after the tiling swap. * fix: give the primary index artifact an opt-in read ceiling Security-audit remediation (CIPHER-02): the primary artifact read was bounded only by the attacker-declared size with no configurable ceiling, unlike the auxiliary/profile classes — the documented THREAT-QUERY-003 mitigation was silently incomplete for the largest artifact in a bundle. Adds ResourceLimits::max_index_artifact_bytes (default unbounded) applied as a min() with the declared size, and bounds the create-path primary read by its observed size so the code matches the CHANGELOG claim. CLI flag parity follows separately. * fix: bound sqlite cache-key hashes by declared sizes Code-review remediation: the sqlite-feature cache-key path duplicated the hashing logic and missed the derived-bound change — the primary artifact hash was fully unbounded, and the calibration/encoder profile hashes were bounded only by the flat caps, which the default flip turned into effectively unbounded reads. All three now use the same declared-size .min(opt-in ceiling) derivation as the verify path; a bound violation is a cache miss. Adds default-limits grown-profile coverage for the calibration and encoder-distortion call sites, closing the per-site test gap that let this slip. * perf: stream the corpus once per call in sign candidate generation top_m_candidates_batched_serial_csr previously looped the single-query path, re-streaming the full sign bitmap per query (documented-naive Track-1). The internals now scan the corpus once per call in L2-sized doc blocks, score every query of the call against each hot block in query tiles via the existing batched kernel, and select per-query top-m with bounded (hamming, doc_id) min-collectors — bit-identical to a full sort by construction, independent of processing order (the key IS the contract's sort key). top_m_candidates routes through the same core, dropping its per-call n-row Hamming materialisation. Per-query corpus traffic drops by the call's query count: at 1.26M rows x 1024 dims, a 2048-query call reads the 161MB sidecar once instead of 2048 times. Serial contract preserved (no rayon); the oracle suite (tests/tiled_candgen.rs) pins bit-identical outputs across random, tie-heavy, duplicate-run, and edge geometries. * perf: keep the dense partition path for single-query candidates Audit remediation: routing top_m_candidates through the streamed core measured +50-90% at small/medium n with m in the hundreds (bounded heap O(n log m) vs select_nth_unstable_by O(n)); with one query there is no scan to share, so nq=1 stays on the dense path (parity-or-better at every measured size). Also per audit: the block-boundary oracle test now genuinely spans three blocks (the dim=128 shape fit one block), and adds the dim=768 AVX-512 tail-residue x multi-block case to the permanent suite. * perf: parallel finite validation and scratch-based rank encode assert_all_finite paid a full serial pass per add/search batch — measured ~0.1s per GiB, twice per ingest batch counting the caller layer. Scans of 1M+ floats now split across the rayon pool (4.4x measured). RankQuant::add's per-row closure allocated a fresh ranks Vec per vector inside the parallel loop; for_each_init now reuses a per-worker scratch via rank_transform_into. Measured on the 1.26M x 1024 corpus slice: encode-path attribution 0.097s serial scan -> 0.022s parallel; alloc churn removed from the hot loop. * perf: reduce collector boundary test to a cached worst-bound compare Doc ids visit each per-query heap strictly ascending, so a candidate tying the worst kept hamming always loses the (hamming, doc_id) tie-break — once the collector is full, the accept test is exactly 'hamming < worst kept hamming'. Cache that bound in a register-friendly u32 (u32::MAX while filling) and skip the heap peek + tuple compare on the ~99.8% reject path. Bit-identical by construction; pinned by the tie-heavy and duplicate-run oracle suites. * perf: LUT + parallel constant-composition check on RankQuant load load_rankquant's forged-buffer defense histogrammed every packed code serially — 1.29 billion shift/mask ops at 1.26M x 1024, ~1s of the 1.27s verified open. A 4KB per-byte bucket-count LUT replaces the per-code inner loop and rows validate in parallel; find_first keeps the lowest-offending-row error contract, with a scalar recheck producing the identical message. The security property is unchanged: every row still proves uniform composition before the index is usable. * docs: changelog perf entries and 0.6.0 downstream un-patch checklist CHANGELOG Unreleased gains the measured perf work merged to integration/full-stack: tiled streaming sign candidate generation + cached collector worst-bound (bit-identical internals swap; downstream batched search 220 -> 10.2k q/s at 1.26M x 1024), parallel finite validation + scratch rank encode (0.097s -> 0.022s attribution), and the LUT + parallel constant-composition load check (verified open 1.27s -> 0.38s). RELEASING gains a one-time pre-publish item: remove OrdinalDB's [patch.crates-io] block pointing at integration/full-stack when 0.6.0 publishes. * feat: index-ceiling CLI parity and zero-size shape checks (CIPHER-04) Expose --max-index-artifact-bytes on the ordvec-manifest CLI LimitArgs, wiring it to ResourceLimits::max_index_artifact_bytes so the opt-in primary-artifact read ceiling reaches feature parity with the existing --max-auxiliary-artifact-bytes flag. Close the deferred CIPHER-04 reason-code symmetry: validate_manifest_shape now rejects a zero manifest-declared artifact.file_size_bytes (artifact_file_size_zero) and validate_auxiliary_artifact_shape rejects zero-size declarations on required auxiliary artifacts (auxiliary_artifact_file_size_zero), mirroring the calibration and encoder-distortion *_file_size_zero checks. Optional artifacts keep the established zero-size absent-placeholder convention. * fix: refuse non-regular artifact files before hashing Security-review remediation (fleet CIPHER-001): with derived read bounds, a FIFO inside the bundle directory would block verification forever — File::open on a reader-less FIFO blocks, and a device node would stream without EOF under a large declared size. Stat the path before opening and refuse anything that is not a regular file, at every hashing call site (create, verify, sqlite cache). Regular files terminate at EOF and remain post-checked against the declaration. Unix regression test uses a real FIFO. * docs: scope the serial CSR contract to scan and selection Security-review note (fleet CIPHER-002): parallel finite validation introduced in the encode train transitively touches the global rayon pool from inside the 'serial' CSR primitive. The serial guarantee is about candidate scan/selection ownership, not input validation; say so explicitly. * fix: retry interrupted reads in bounded streaming hash Bot-review remediation (Qodo, PR #277): the streaming loop surfaced ErrorKind::Interrupted as a hash failure, a reliability regression vs the previous read_to_end which retried EINTR internally. * fix: assert whole-row query buffers in the streamed core Bot-review remediation (Qodo, PR #278): the shared core derived nq by integer division; a ragged buffer from a future caller would silently truncate. All current callers validate upstream — this is the cheap in-core invariant. * perf: transpose-tree horizontal reduction in the batched sign kernel The AVX-512 batched scan paid eight serial _mm512_reduce_add_epi64 expansions per doc-chunk — roughly a third of per-doc cycles at dim=1024 (2 lanes) going to reduction rather than XOR+POPCNT work. An unpack/permute/shuffle tree folds all eight accumulators into one vector of sums (~25 ops replacing ~50), stored via one stack spill. Tail path (batch % 8) keeps the per-accumulator reduce. Bit-identical: pinned by the AVX-512-vs-scalar parity tests and the oracle suites. * docs: scope the serial-contract claim in the tiled candgen entry External-audit remediation: the entry claimed 'no rayon' unqualified; finite validation on large buffers may use the global pool (documented on the method), and top_m_candidates_batched is explicitly out of scope of the internals swap. * bench: regenerate committed synthetic results at the 0.6.0 heads two_stage_caller_owned_dim1024: stage-1 candidate generation 159.60 -> 94.60 us/query (1.69x), full two-stage 172.42 -> 103.75 us/query (1.66x) — same command, host, core pinning, and toolchain family; verified code-only by an A/B against main on the same day/machine (main reproduced the old numbers within 3%). rank_modes: single-query latency rows are intentionally unchanged by the batch rework (verified identical-within-noise main vs heads) and carry a refresh note saying so; encode columns reflect the parallel validation + scratch encode work. Quality columns bit-identical throughout. * docs: refresh README benchmarks at the 0.6.0 heads All figures and numbers regenerated by the committed make benchmark-beir pipeline on the same host class (9950X). Quality: nDCG within bootstrap noise of exact on both datasets, sign-rq2 trec-covid 0.7638 unchanged (deterministic selection held bit-identical through the perf train). Single-query hero effectively unchanged (52.4 ms flat vs 0.52 ms sign-rq2, ~101x) — that lane was intentionally untouched. Batched 1-thread view improves to ~10-12x over batched flat (once-per-call corpus streaming); threaded view: HNSW still leads, margin narrowed from ~2.3x to 1.6x over sign-rq2 (1.2x over bitmap-rq2). Build 47.1s vs 0.21s. No larger-corpus claims added. * docs: transcribe the refreshed hnsw nDCG in the tradeoff table * fix: land the EINTR retry on the bounded hash and bump to 0.6.0 CI caught both halves: the retry from the Qodo remediation had landed on sha256_file (where the io::Error conversion is a useless-conversion lint under current clippy) instead of sha256_file_bounded, the function actually flagged — the retry now lives on both, with the conversion only where the error type changes. And the release-publish invariant correctly refuses [Unreleased] changelog content at an already-released version: this stack is the 0.6.0 work, so ordvec and ordvec-manifest now say 0.6.0 (minor: limit-semantics change, additive APIs, behavioral perf changes). * chore: track 0.6.0 in the fuzz workspace lock * chore: lockstep all workspace member versions at 0.6.0 The release-publish SBOM invariant requires member package versions in lockstep with the root; ordvec-ffi and both python bindings follow the 0.6.0 bump. * chore: complete the 0.6.0 release shape Closes the remaining release-publish invariant layers, verified by running tests/release_publish_invariants.py locally to a clean exit: pyproject + __init__ versions in lockstep, the changelog cut as a dated 0.6.0 section (invariant convention: the current version always has a dated section; [Unreleased] stays empty), THREAT_MODEL status line at v0.6.0, and the README quickstart installing 0.6. * fix: checked selection-state bounds in the streamed candidate core Bot-review remediation (Qodo, PR #278): nq * m_eff can overflow usize on 32-bit/wasm32 targets, and the CSR wrapper's saturating_mul would attempt a usize::MAX allocation. Both sites now use checked multiplication with a clear tile-the-batch message, matching the crate's checked-allocation discipline. The exact m_eff + 1 heap reservation is kept deliberately: gradual growth double-allocates to the next power of two (~2x peak per query) — the reservation is the memory-optimal choice, now documented. * fix: single-snapshot create hashing, strict read bounds, zero-size primary shape check Bot-review remediation (Qodo, #283 inline): - create_manifest_for_index_with_options observed the index size twice (probe, then a separate stat for the hash bound) — a concurrent writer could produce a manifest whose size and digest describe different bytes. The hash is now bounded by the probe's size, the manifest records the byte count actually hashed, and any disagreement fails loudly. - sha256_file_bounded could read (not hash) up to one 64KiB chunk past the bound; reads now clamp to max_bytes + 1, mirroring read_bounded_file's take() pattern. - validate_manifest_shape gains artifact_file_size_zero for the primary artifact, matching the profile artifacts' explicit zero rejection instead of surfacing a confusing artifact_file_too_large. * perf: build query bitmaps in place in the streamed core Bot-review remediation (Qodo, #283 inline): build_query_bitmap allocated a fresh Vec and re-validated finiteness per query; the entry points already validate the whole buffer and the destination is preallocated. Oracle suites pin bit-identical output. * fix: apply the index ceiling on the create path Bot-review remediation (Qodo, #282): --max-index-artifact-bytes wired into ResourceLimits but the create path bounded the primary hash by the probed size alone — the opt-in ceiling was ineffective for create, unlike auxiliary artifacts. Create now mirrors verify: declared/observed size min explicit ceiling.
…#282) Expose --max-index-artifact-bytes on the ordvec-manifest CLI LimitArgs, wiring it to ResourceLimits::max_index_artifact_bytes so the opt-in primary-artifact read ceiling reaches feature parity with the existing --max-auxiliary-artifact-bytes flag. Close the deferred CIPHER-04 reason-code symmetry: validate_manifest_shape now rejects a zero manifest-declared artifact.file_size_bytes (artifact_file_size_zero) and validate_auxiliary_artifact_shape rejects zero-size declarations on required auxiliary artifacts (auxiliary_artifact_file_size_zero), mirroring the calibration and encoder-distortion *_file_size_zero checks. Optional artifacts keep the established zero-size absent-placeholder convention.
|
Merged to main via #283's squash ( |
Summary
file_size_bytes; the manifest itself stays hard-capped at 1 MiB and SHA-256 pins content. Create path bounds reads by the artifact's observed size.ResourceLimitsbyte caps (max_auxiliary_artifact_bytes,max_calibration_profile_bytes,max_encoder_distortion_profile_bytes) become opt-in ceilings, default unbounded. Explicitly configured caps behave exactly as before.artifact_file_too_largereason code (fail-fast on grown artifacts instead of digest-mismatch after hashing the excess).sha256_file_boundedstreams with a 64 KiB buffer — constant memory at any artifact size (previously materialised the whole file).Why
The 64 MiB auxiliary default made legitimate sign sidecars impossible to persist past 524,288 rows at dim=1024 (
sign.ovsb= rows × dim/8). Measured on a 1,258,135-row × 1024-dim corpus:write_verified_bundlefailed with default options. A security bound meant for hostile foreign input was applied to self-written artifacts.Hostile-input posture (what precisely changed)
Memory safety and primary-artifact bounding improved; default verification time/IO on hostile bundles is now deployment-bounded via the opt-in
ResourceLimitsceilings (the old flat 64 MiB default also bounded attacker-supplied I/O — that bound is now a knob, documented in THREAT-QUERY-003). Note:file_size_bytesis a required manifest field — v1 manifests without it fail deserialization; there is no fallback path.auxiliary_artifact_file_size_mismatch; grown artifact → fail-fast*_file_too_large; truncation → size mismatch. All covered by new tests.VerifiedLoadPlanis a snapshot, not a byte pin.Tests
tests/derived_limits.rs: >64 MiB roundtrip at pure defaults, grown/truncated/inflated-declaration cases, explicit-cap back-compat, primary-artifact bound.artifact_file_too_large); corruption coverage preserved by switching that test to in-place corruption.Part of the 1M-row release train (Track A1 of the locked master plan).