[WIP][tantivy] Fix unnecessary .store file generation in full-text index by chenghuichen · Pull Request #7670 · apache/paimon

chenghuichen · 2026-04-19T10:54:54Z

Purpose

Tantivy is used purely as an inverted index in Paimon, so .store files (raw field values) are never needed or read. The original implementation mistakenly set row_id as stored=True, wasting 30% or even more of archive size per index file.

This PR removes .set_stored() from the schema and filters .store files when packing the archive, and updates the Python reader accordingly.

Tests

Re-using JavaPyTantivyE2ETest#testTantivyFullTextIndexWrite +test_read_tantivy_full_text_inde

jerry-024 · 2026-04-21T03:30:09Z

+            (addr.segment_ord, addr.doc): score
+            for score, addr in scored_results.hits
+        }
+


This fallback looks like a potentially large performance regression for broad queries. We only need the row_id for the top-limit hits in scored_results, but this second search asks tantivy-py to collect up to searcher.num_docs matches ordered by row_id. For a common term on a large shard, a limit=10 lookup can now degenerate into scanning/materializing almost the full match set just to recover 10 ids. Could we keep row_id stored until batch fast-field access is available in the shipped tantivy-py version, or add a direct fast-field read path instead?

Thanks for the review! Based on feedback from the tantivy-py community, v0.26 with fast-field access should be released in a few weeks (see quickwit-oss/tantivy-py#641). I think it makes sense to hold this PR for now — I'll track the community progress and ping you once the code is updated.

chenghuichen added 3 commits April 19, 2026 18:52

Fix unnecessary .store file generation in full-text index

bdb43c8

Fix unnecessary .store file generation in full-text index

86025b5

Fix unnecessary .store file generation in full-text index

da0833c

jerry-024 reviewed Apr 21, 2026

View reviewed changes

chenghuichen changed the title ~~[tantivy] Fix unnecessary .store file generation in full-text index~~ [WIP][tantivy] Fix unnecessary .store file generation in full-text index Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][tantivy] Fix unnecessary .store file generation in full-text index#7670

[WIP][tantivy] Fix unnecessary .store file generation in full-text index#7670
chenghuichen wants to merge 3 commits intoapache:masterfrom
chenghuichen:tantivy-fix

chenghuichen commented Apr 19, 2026 •

edited

Loading

Uh oh!

jerry-024 Apr 21, 2026

Uh oh!

chenghuichen Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chenghuichen commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

Uh oh!

jerry-024 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chenghuichen Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenghuichen commented Apr 19, 2026 •

edited

Loading

chenghuichen Apr 21, 2026 •

edited

Loading