Skip to content

perf: LUT + parallel constant-composition check on RankQuant load#281

Open
Nelson Spence (Fieldnote-Echo) wants to merge 1 commit into
mainfrom
perf/parallel-load-validation
Open

perf: LUT + parallel constant-composition check on RankQuant load#281
Nelson Spence (Fieldnote-Echo) wants to merge 1 commit into
mainfrom
perf/parallel-load-validation

Conversation

@Fieldnote-Echo

Copy link
Copy Markdown
Member

Summary

load_rankquant's forged-buffer defense (constant-composition histogram) ran serially over every packed code — 1.29 billion shift/mask ops at 1.26M × 1024, ~1.0s of the 1.27s verified open (attribution: verify_for_load alone is 0.215s; SHA-NI is fine — the loader was the cost).

  • 4KB per-byte bucket-count LUT replaces the per-code inner loop (dim ops/row → bytes_per_row lookups)
  • Rows validate in parallel; find_first preserves the lowest-offending-row contract, and a scalar recheck of that row produces the byte-identical error message
  • Security property unchanged: every row still proves uniform composition before the index is usable

Expected verified-open: 1.27s → ~0.3s at 1.26M×1024 (measured number follows in the integration rerun).

Independent branch off main (touches only rank_io.rs). 250 tests green, clippy -D warnings, fmt.

load_rankquant's forged-buffer defense histogrammed every packed code
serially — 1.29 billion shift/mask ops at 1.26M x 1024, ~1s of the
1.27s verified open. A 4KB per-byte bucket-count LUT replaces the
per-code inner loop and rows validate in parallel; find_first keeps
the lowest-offending-row error contract, with a scalar recheck
producing the identical message. The security property is unchanged:
every row still proves uniform composition before the index is
usable.
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Speed up RankQuant load validation with LUT + parallel row checks

✨ Enhancement 🕐 20-40 Minutes

Grey Divider

AI Description

• Replace per-code histogramming with a per-byte bucket-count LUT during RankQuant load validation.
• Validate packed rows in parallel while preserving “lowest offending row” error semantics.
• Recompute the first failing row scalar to keep the exact error message contract.
Diagram

graph TD
  A["load_rankquant_from_stream"] --> B["Build byte→bucket LUT"] --> C["Parallel row validate (rayon)"] --> D{"All rows valid?"}
  D -->|"Yes"| E["Return packed RankQuant"]
  D -->|"No"| F["Scalar recheck bad row"] --> G["Return exact error"]
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Precompute LUT as const/static per bits
  • ➕ Avoids rebuilding the LUT on each load
  • ➕ Keeps the same algorithmic improvement without adding runtime setup work
  • ➖ More code complexity (const-fn/table generation, per-bits selection)
  • ➖ Less flexible if packing rules evolve
2. SIMD-accelerated scalar histogram (no rayon)
  • ➕ No parallel scheduling overhead; simpler determinism story
  • ➕ Potentially strong speedups with careful vectorization
  • ➖ Higher implementation complexity and portability concerns
  • ➖ Harder to preserve identical error behavior without a second pass anyway
3. Chunked parallelism over byte-slices (reduce per-row overhead)
  • ➕ Could reduce rayon iterator overhead by batching multiple rows per task
  • ➕ May improve scaling on large corpora
  • ➖ More complex work partitioning
  • ➖ Makes preserving ‘first failing row’ semantics trickier

Recommendation: The current approach is a good balance: the LUT removes the dominant inner-loop cost, rayon parallelizes an embarrassingly-parallel validation step, and the scalar recheck preserves the exact error contract. If further tuning is needed, consider making the LUT static per bits/codes_per_byte, but only if profiling shows LUT build time is material.

Files changed (1) +34 / -2

Enhancement (1) +34 / -2
rank_io.rsOptimize constant-composition validation with LUT + rayon find_first +34/-2

Optimize constant-composition validation with LUT + rayon find_first

• Replaces per-code shift/mask histogramming with a 256×16 per-byte lookup table and validates each packed row in parallel. Uses rayon’s find_first to preserve the lowest-offending-row contract, and reruns a scalar histogram on that row to keep the exact bucket/count error message unchanged.

src/rank_io.rs

@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/rank_io.rs 95.83% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@qodo-code-review

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider

Great, no issues found!

Qodo reviewed your code and found no material issues that require review

Grey Divider

Qodo Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants