[WIP][tantivy] Fix unnecessary .store file generation in full-text index#7670
[WIP][tantivy] Fix unnecessary .store file generation in full-text index#7670chenghuichen wants to merge 3 commits intoapache:masterfrom
Conversation
| (addr.segment_ord, addr.doc): score | ||
| for score, addr in scored_results.hits | ||
| } | ||
|
|
There was a problem hiding this comment.
This fallback looks like a potentially large performance regression for broad queries. We only need the row_id for the top-limit hits in scored_results, but this second search asks tantivy-py to collect up to searcher.num_docs matches ordered by row_id. For a common term on a large shard, a limit=10 lookup can now degenerate into scanning/materializing almost the full match set just to recover 10 ids. Could we keep row_id stored until batch fast-field access is available in the shipped tantivy-py version, or add a direct fast-field read path instead?
There was a problem hiding this comment.
Thanks for the review! Based on feedback from the tantivy-py community, v0.26 with fast-field access should be released in a few weeks (see quickwit-oss/tantivy-py#641). I think it makes sense to hold this PR for now — I'll track the community progress and ping you once the code is updated.
Purpose
Tantivy is used purely as an inverted index in Paimon, so
.storefiles (raw field values) are never needed or read. The original implementation mistakenly setrow_idasstored=True, wasting 30% or even more of archive size per index file.This PR removes
.set_stored()from the schema and filters.storefiles when packing the archive, and updates the Python reader accordingly.Tests
Re-using
JavaPyTantivyE2ETest#testTantivyFullTextIndexWrite+test_read_tantivy_full_text_inde