[fix](nereids) Do not push offset-only access path for CHAR columns#62854
[fix](nereids) Do not push offset-only access path for CHAR columns#62854BiteTheDDDDt wants to merge 2 commits intoapache:masterfrom
Conversation
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#62205 Problem Summary: PR apache#62205 optimizes `length(str_col)` by pushing an OFFSET access path to BE so the chars buffer can be skipped. The optimization predicate is `isStringLikeType()`, which incorrectly includes CHAR(N). CHAR is stored padded to its declared length on BE (see OlapColumnDataConvertorChar::clone_and_padding). The per-row length recorded in dict word info / page headers is therefore always the padded length (e.g. 25 for CHAR(25)), never the logical length expected by length(). Recovering the logical length requires scanning the chars buffer with strnlen() (shrink_padding_chars), which OFFSET_ONLY by definition skips. In addition, when SegmentIterator runs shrink_char_type_column_suffix_zero on the CHAR column, it reads the (only-resized, never-initialized) chars buffer with strnlen and rewrites the offsets, corrupting them with non-deterministic values. As a result, `select length(c) from t` on a CHAR column returns garbage (mix of 0 / random small numbers / padded length 25), depending on what was previously in heap memory. Fix: 1. FE: in AccessPathExpressionCollector.visitLength, skip the OFFSET suffix when the resolved leaf type is CHAR. Because the check is on `arg.getDataType()` (the type after struct field / map subscript / array index resolution), a single guard covers top-level CHAR as well as nested CHAR (struct.char_field, map<X,char>['k'], array<char>[i]). 2. BE: in StringFileColumnIterator::set_access_paths, hard-fail with Status::InternalError when an OFFSET access path is received for a CHAR column. This catches any future planner regression that breaks the contract above. A small protected accessor get_reader() is added to FileColumnIterator so the derived string iterator can inspect the storage field type. VARCHAR / STRING are not padded, so OFFSET_ONLY remains correct for them. Other types (VARBINARY, JSONB, HLL, BITMAP, Variant) are not `isStringLikeType()`, so FE never emits OFFSET for them. Container length / cardinality (length(arr) / cardinality(map)) use their own offset stream as canonical metadata and are unaffected. ### Release note Fix wrong / non-deterministic result of length() on CHAR columns when no other column from the same row is projected (e.g. `select length(c) from t` or `select length(c), count(*) from t group by length(c)`). ### Check List (For Author) - Test: Manual test (mysql client, repro from a 200-row CHAR(25) table). No regression test added in this PR; will follow up. - Behavior changed: Yes (length(CHAR) now returns the correct logical length; previously it could return 0, the padded length, or other garbage). - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Pull request overview
Fixes a correctness bug introduced by the length(str_col) OFFSET-only pruning optimization by ensuring CHAR columns never use OFFSET-only access paths (since CHAR storage is padded and logical length cannot be recovered from offsets alone).
Changes:
- FE: Update
AccessPathExpressionCollector.visitLengthto avoid emitting theOFFSETsuffix when the resolved leaf type isCHAR. - BE: Add a protected
FileColumnIterator::get_reader()accessor and makeStringFileColumnIterator::set_access_pathshard-fail if it receivesOFFSET_ONLYfor a CHAR column.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathExpressionCollector.java | Excludes CHAR from the length() OFFSET-only access path emission to preserve correctness. |
| be/src/storage/segment/column_reader.h | Exposes ColumnReader to derived iterators via a protected accessor for metadata checks. |
| be/src/storage/segment/column_reader.cpp | Enforces FE/BE contract by rejecting OFFSET_ONLY on CHAR at runtime (InternalError). |
Comments suppressed due to low confidence (1)
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathExpressionCollector.java:176
- Please add/extend FE unit coverage to lock in this CHAR-specific behavior (i.e., ensure
length(char_col)does NOT emit an OFFSET access path, whilelength(varchar/string)still does). There are existing nested column pruning tests underfe/fe-core/src/test/java/org/apache/doris/nereids/rules/rewrite/PruneNestedColumnTest.java, but none currently assert this CHAR exclusion, so this could regress silently.
if (arg.getDataType().isStringLikeType() && !arg.getDataType().isCharType()
&& context.accessPathBuilder.isEmpty()) {
CollectorContext offsetContext =
new CollectorContext(context.statementContext, context.bottomFilter);
offsetContext.accessPathBuilder.addSuffix(AccessPathInfo.ACCESS_STRING_OFFSET);
return arg.accept(this, offsetContext);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (read_offset_only() && get_reader() != nullptr && | ||
| get_reader()->get_meta_type() == FieldType::OLAP_FIELD_TYPE_CHAR) { | ||
| return Status::InternalError( | ||
| "OFFSET_ONLY access path is not supported on CHAR column '{}': CHAR is stored " | ||
| "padded so the per-row length information available without reading the chars " | ||
| "buffer is always the padded length, not the logical length. The FE planner " | ||
| "must not emit an OFFSET access path for CHAR columns.", | ||
| _column_name); |
There was a problem hiding this comment.
Please add a BE unit test to cover this new contract enforcement (OFFSET_ONLY + CHAR => InternalError). There are already iterator access-path tests in be/test/storage/segment/column_reader_test.cpp, so adding a regression case here would prevent accidental reintroduction of OFFSET paths for CHAR (whether via planner regressions or future refactors).
There was a problem hiding this comment.
Found 2 blocking issues.
- Mixed-version FE/BE compatibility is broken by the new BE-side hard error: an old FE still emits [col, OFFSET] for length(char_col), so upgrading BE first turns this from a wrong-result bug into a query failure.
- The fix ships without automated coverage: there is no FE unit test for the CHAR access-path change and no regression test for the SQL reproduction in the PR description.
Critical checkpoints:
- Goal of current task: prevent wrong or non-deterministic length(CHAR) results when only the length is projected. The FE-side change addresses the fully upgraded path, but the BE-side hard failure makes mixed-version execution fail instead. No automated test currently proves the fix.
- Change scope: small and focused.
- Concurrency: no new concurrency or locking concerns in the touched code.
- Lifecycle or static init: no special lifecycle issues introduced.
- Configuration: no new configs.
- Compatibility: not satisfied. This PR changes the FE/BE contract for OFFSET access paths on CHAR without a smooth-upgrade or backward-compatible path.
- Parallel code paths: the updated visitLength() check appears to cover nested CHAR leaves as intended; I did not find another new planner path in this PR that still purposely sends OFFSET for CHAR.
- Special conditionals: the CHAR exclusion itself is well justified in comments.
- Test coverage: insufficient. Manual verification only; missing FE unit coverage and regression coverage for the SQL repro.
- Test outputs: no new or updated test outputs.
- Observability: the BE error text is clear if triggered.
- Transaction, persistence, data writes, and FE-BE variable passing: not applicable in the touched code.
- Performance: for CHAR, reading the chars buffer is necessary for correctness; I did not find a separate performance regression beyond that.
- Other issues: none beyond the two findings above.
User focus:
- No additional user-provided review focus was supplied, so there were no extra focus-specific findings beyond the full review.
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
Related PR: #62205
Problem Summary
PR #62205 optimizes
length(str_col)by pushing an OFFSET access path to BE so the chars buffer can be skipped. The optimization predicate isisStringLikeType(), which incorrectly includes CHAR(N).CHAR is stored padded to its declared length on BE (see
OlapColumnDataConvertorChar::clone_and_padding). The per-row length recorded in dict word info / page headers is therefore always the padded length (e.g. 25 for CHAR(25)), never the logical length expected bylength(). Recovering the logical length requires scanning the chars buffer withstrnlen()(shrink_padding_chars), which OFFSET_ONLY by definition skips.In addition, when
SegmentIteratorrunsshrink_char_type_column_suffix_zeroon the CHAR column, it reads the (only-resized, never-initialized) chars buffer withstrnlenand rewrites the offsets, corrupting them with non-deterministic values. As a result,select length(c) from ton a CHAR column returns garbage (mix of0/ random small numbers / padded length25), depending on what was previously in heap memory.Reproduction
Fix
AccessPathExpressionCollector.visitLength, skip the OFFSET suffix when the resolved leaf type is CHAR. Because the check is onarg.getDataType()(the type after struct field / map subscript / array index resolution), a single guard covers top-level CHAR as well as nested CHAR (struct.char_field,map<X,char>['k'],array<char>[i]).StringFileColumnIterator::set_access_paths, hard-fail withStatus::InternalErrorwhen an OFFSET access path is received for a CHAR column. This catches any future planner regression that breaks the contract above. A small protected accessorget_reader()is added toFileColumnIteratorso the derived string iterator can inspect the storage field type.VARCHAR / STRING are not padded, so OFFSET_ONLY remains correct for them. Other types (VARBINARY, JSONB, HLL, BITMAP, Variant) are not
isStringLikeType(), so FE never emits OFFSET for them. Container length / cardinality (length(arr)/cardinality(map)) use their own offset stream as canonical metadata and are unaffected.Release note
Fix wrong / non-deterministic result of
length()on CHAR columns when no other column from the same row is projected (e.g.select length(c) from torselect length(c), count(*) from t group by length(c)).Check List (For Author)
length(CHAR)now returns the correct logical length; previously it could return0, the padded length, or other garbage.