Skip to content

[WIP][tantivy] Fix unnecessary .store file generation in full-text index#7670

Open
chenghuichen wants to merge 3 commits intoapache:masterfrom
chenghuichen:tantivy-fix
Open

[WIP][tantivy] Fix unnecessary .store file generation in full-text index#7670
chenghuichen wants to merge 3 commits intoapache:masterfrom
chenghuichen:tantivy-fix

Conversation

@chenghuichen
Copy link
Copy Markdown
Contributor

@chenghuichen chenghuichen commented Apr 19, 2026

Purpose

Tantivy is used purely as an inverted index in Paimon, so .store files (raw field values) are never needed or read. The original implementation mistakenly set row_id as stored=True, wasting 30% or even more of archive size per index file.

This PR removes .set_stored() from the schema and filters .store files when packing the archive, and updates the Python reader accordingly.

Tests

Re-using JavaPyTantivyE2ETest#testTantivyFullTextIndexWrite +test_read_tantivy_full_text_inde

(addr.segment_ord, addr.doc): score
for score, addr in scored_results.hits
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fallback looks like a potentially large performance regression for broad queries. We only need the row_id for the top-limit hits in scored_results, but this second search asks tantivy-py to collect up to searcher.num_docs matches ordered by row_id. For a common term on a large shard, a limit=10 lookup can now degenerate into scanning/materializing almost the full match set just to recover 10 ids. Could we keep row_id stored until batch fast-field access is available in the shipped tantivy-py version, or add a direct fast-field read path instead?

Copy link
Copy Markdown
Contributor Author

@chenghuichen chenghuichen Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Based on feedback from the tantivy-py community, v0.26 with fast-field access should be released in a few weeks (see quickwit-oss/tantivy-py#641). I think it makes sense to hold this PR for now — I'll track the community progress and ping you once the code is updated.

@chenghuichen chenghuichen changed the title [tantivy] Fix unnecessary .store file generation in full-text index [WIP][tantivy] Fix unnecessary .store file generation in full-text index Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants