[fix](nereids) Do not push offset-only access path for CHAR columns by BiteTheDDDDt · Pull Request #62854 · apache/doris

BiteTheDDDDt · 2026-04-27T04:09:11Z

What problem does this PR solve?

Related PR: #62205

Problem Summary

PR #62205 optimizes length(str_col) by pushing an OFFSET access path to BE so the chars buffer can be skipped. The optimization predicate is isStringLikeType(), which incorrectly includes CHAR(N).

CHAR is stored padded to its declared length on BE (see OlapColumnDataConvertorChar::clone_and_padding). The per-row length recorded in dict word info / page headers is therefore always the padded length (e.g. 25 for CHAR(25)), never the logical length expected by length(). Recovering the logical length requires scanning the chars buffer with strnlen() (shrink_padding_chars), which OFFSET_ONLY by definition skips.

In addition, when SegmentIterator runs shrink_char_type_column_suffix_zero on the CHAR column, it reads the (only-resized, never-initialized) chars buffer with strnlen and rewrites the offsets, corrupting them with non-deterministic values. As a result, select length(c) from t on a CHAR column returns garbage (mix of 0 / random small numbers / padded length 25), depending on what was previously in heap memory.

Reproduction

CREATE TABLE t (
    pk INT,
    c CHAR(25) NOT NULL
) DUPLICATE KEY(pk) DISTRIBUTED BY HASH(pk) BUCKETS 1
PROPERTIES('replication_num'='1');

INSERT INTO t VALUES (1,'a'), (2,'bb'), (3,'ccc');

-- Wrong (mix of 0 / 25 / random):
SELECT length(c) FROM t;
-- Correct (1, 2, 3):
SELECT c, length(c) FROM t;

Fix

FE: in AccessPathExpressionCollector.visitLength, skip the OFFSET suffix when the resolved leaf type is CHAR. Because the check is on arg.getDataType() (the type after struct field / map subscript / array index resolution), a single guard covers top-level CHAR as well as nested CHAR (struct.char_field, map<X,char>['k'], array<char>[i]).
BE: in StringFileColumnIterator::set_access_paths, hard-fail with Status::InternalError when an OFFSET access path is received for a CHAR column. This catches any future planner regression that breaks the contract above. A small protected accessor get_reader() is added to FileColumnIterator so the derived string iterator can inspect the storage field type.

VARCHAR / STRING are not padded, so OFFSET_ONLY remains correct for them. Other types (VARBINARY, JSONB, HLL, BITMAP, Variant) are not isStringLikeType(), so FE never emits OFFSET for them. Container length / cardinality (length(arr) / cardinality(map)) use their own offset stream as canonical metadata and are unaffected.

Release note

Fix wrong / non-deterministic result of length() on CHAR columns when no other column from the same row is projected (e.g. select length(c) from t or select length(c), count(*) from t group by length(c)).

Check List (For Author)

Test:
- Manual test (mysql client, repro from a 200-row CHAR(25) table)
- Regression Test — to be added in a follow-up
Behavior changed: Yes — length(CHAR) now returns the correct logical length; previously it could return 0, the padded length, or other garbage.
Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#62205 Problem Summary: PR apache#62205 optimizes `length(str_col)` by pushing an OFFSET access path to BE so the chars buffer can be skipped. The optimization predicate is `isStringLikeType()`, which incorrectly includes CHAR(N). CHAR is stored padded to its declared length on BE (see OlapColumnDataConvertorChar::clone_and_padding). The per-row length recorded in dict word info / page headers is therefore always the padded length (e.g. 25 for CHAR(25)), never the logical length expected by length(). Recovering the logical length requires scanning the chars buffer with strnlen() (shrink_padding_chars), which OFFSET_ONLY by definition skips. In addition, when SegmentIterator runs shrink_char_type_column_suffix_zero on the CHAR column, it reads the (only-resized, never-initialized) chars buffer with strnlen and rewrites the offsets, corrupting them with non-deterministic values. As a result, `select length(c) from t` on a CHAR column returns garbage (mix of 0 / random small numbers / padded length 25), depending on what was previously in heap memory. Fix: 1. FE: in AccessPathExpressionCollector.visitLength, skip the OFFSET suffix when the resolved leaf type is CHAR. Because the check is on `arg.getDataType()` (the type after struct field / map subscript / array index resolution), a single guard covers top-level CHAR as well as nested CHAR (struct.char_field, map<X,char>['k'], array<char>[i]). 2. BE: in StringFileColumnIterator::set_access_paths, hard-fail with Status::InternalError when an OFFSET access path is received for a CHAR column. This catches any future planner regression that breaks the contract above. A small protected accessor get_reader() is added to FileColumnIterator so the derived string iterator can inspect the storage field type. VARCHAR / STRING are not padded, so OFFSET_ONLY remains correct for them. Other types (VARBINARY, JSONB, HLL, BITMAP, Variant) are not `isStringLikeType()`, so FE never emits OFFSET for them. Container length / cardinality (length(arr) / cardinality(map)) use their own offset stream as canonical metadata and are unaffected. ### Release note Fix wrong / non-deterministic result of length() on CHAR columns when no other column from the same row is projected (e.g. `select length(c) from t` or `select length(c), count(*) from t group by length(c)`). ### Check List (For Author) - Test: Manual test (mysql client, repro from a 200-row CHAR(25) table). No regression test added in this PR; will follow up. - Behavior changed: Yes (length(CHAR) now returns the correct logical length; previously it could return 0, the padded length, or other garbage). - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Thearas · 2026-04-27T04:09:17Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

BiteTheDDDDt · 2026-04-27T04:10:29Z

/review

Copilot

Pull request overview

Fixes a correctness bug introduced by the length(str_col) OFFSET-only pruning optimization by ensuring CHAR columns never use OFFSET-only access paths (since CHAR storage is padded and logical length cannot be recovered from offsets alone).

Changes:

FE: Update AccessPathExpressionCollector.visitLength to avoid emitting the OFFSET suffix when the resolved leaf type is CHAR.
BE: Add a protected FileColumnIterator::get_reader() accessor and make StringFileColumnIterator::set_access_paths hard-fail if it receives OFFSET_ONLY for a CHAR column.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathExpressionCollector.java	Excludes CHAR from the length() OFFSET-only access path emission to preserve correctness.
be/src/storage/segment/column_reader.h	Exposes `ColumnReader` to derived iterators via a protected accessor for metadata checks.
be/src/storage/segment/column_reader.cpp	Enforces FE/BE contract by rejecting OFFSET_ONLY on CHAR at runtime (InternalError).

Comments suppressed due to low confidence (1)

fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathExpressionCollector.java:176

Please add/extend FE unit coverage to lock in this CHAR-specific behavior (i.e., ensure length(char_col) does NOT emit an OFFSET access path, while length(varchar/string) still does). There are existing nested column pruning tests under fe/fe-core/src/test/java/org/apache/doris/nereids/rules/rewrite/PruneNestedColumnTest.java, but none currently assert this CHAR exclusion, so this could regress silently.

        if (arg.getDataType().isStringLikeType() && !arg.getDataType().isCharType()
                && context.accessPathBuilder.isEmpty()) {
            CollectorContext offsetContext =
                    new CollectorContext(context.statementContext, context.bottomFilter);
            offsetContext.accessPathBuilder.addSuffix(AccessPathInfo.ACCESS_STRING_OFFSET);
            return arg.accept(this, offsetContext);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T04:13:41Z

+    if (read_offset_only() && get_reader() != nullptr &&
+        get_reader()->get_meta_type() == FieldType::OLAP_FIELD_TYPE_CHAR) {
+        return Status::InternalError(
+                "OFFSET_ONLY access path is not supported on CHAR column '{}': CHAR is stored "
+                "padded so the per-row length information available without reading the chars "
+                "buffer is always the padded length, not the logical length. The FE planner "
+                "must not emit an OFFSET access path for CHAR columns.",
+                _column_name);


Please add a BE unit test to cover this new contract enforcement (OFFSET_ONLY + CHAR => InternalError). There are already iterator access-path tests in be/test/storage/segment/column_reader_test.cpp, so adding a regression case here would prevent accidental reintroduction of OFFSET paths for CHAR (whether via planner regressions or future refactors).

github-actions

Found 2 blocking issues.

Mixed-version FE/BE compatibility is broken by the new BE-side hard error: an old FE still emits [col, OFFSET] for length(char_col), so upgrading BE first turns this from a wrong-result bug into a query failure.
The fix ships without automated coverage: there is no FE unit test for the CHAR access-path change and no regression test for the SQL reproduction in the PR description.

Critical checkpoints:

Goal of current task: prevent wrong or non-deterministic length(CHAR) results when only the length is projected. The FE-side change addresses the fully upgraded path, but the BE-side hard failure makes mixed-version execution fail instead. No automated test currently proves the fix.
Change scope: small and focused.
Concurrency: no new concurrency or locking concerns in the touched code.
Lifecycle or static init: no special lifecycle issues introduced.
Configuration: no new configs.
Compatibility: not satisfied. This PR changes the FE/BE contract for OFFSET access paths on CHAR without a smooth-upgrade or backward-compatible path.
Parallel code paths: the updated visitLength() check appears to cover nested CHAR leaves as intended; I did not find another new planner path in this PR that still purposely sends OFFSET for CHAR.
Special conditionals: the CHAR exclusion itself is well justified in comments.
Test coverage: insufficient. Manual verification only; missing FE unit coverage and regression coverage for the SQL repro.
Test outputs: no new or updated test outputs.
Observability: the BE error text is clear if triggered.
Transaction, persistence, data writes, and FE-BE variable passing: not applicable in the touched code.
Performance: for CHAR, reading the chars buffer is necessary for correctness; I did not find a separate performance regression beyond that.
Other issues: none beyond the two findings above.

User focus:

No additional user-provided review focus was supplied, so there were no extra focus-specific findings beyond the full review.

BiteTheDDDDt · 2026-04-27T05:19:57Z

run buildall

hello-stephen · 2026-04-27T06:51:26Z

FE UT Coverage Report

Increment line coverage 0.00% (0/2) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-04-27T08:41:36Z

BE Regression && UT Coverage Report

Increment line coverage 30.00% (3/10) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.86% (27694/37497)
Line Coverage	57.57% (299508/520243)
Region Coverage	54.66% (248741/455086)
Branch Coverage	56.29% (107799/191502)

hello-stephen · 2026-04-27T08:53:20Z

FE Regression Coverage Report

Increment line coverage 0.00% (0/46) 🎉
Increment coverage report
Complete coverage report

Copilot AI review requested due to automatic review settings April 27, 2026 04:09

Copilot started reviewing on behalf of BiteTheDDDDt April 27, 2026 04:09 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

github-actions Bot requested changes Apr 27, 2026

View reviewed changes

Comment thread be/src/storage/segment/column_reader.cpp

Comment thread ...core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathExpressionCollector.java

update case

8e8be80

BiteTheDDDDt added dev/4.0.x dev/4.1.x labels Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](nereids) Do not push offset-only access path for CHAR columns#62854

[fix](nereids) Do not push offset-only access path for CHAR columns#62854
BiteTheDDDDt wants to merge 2 commits intoapache:masterfrom
BiteTheDDDDt:fix-length-of-char-offset-only

BiteTheDDDDt commented Apr 27, 2026

Uh oh!

Thearas commented Apr 27, 2026

Uh oh!

BiteTheDDDDt commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

BiteTheDDDDt commented Apr 27, 2026

Uh oh!

hello-stephen commented Apr 27, 2026

Uh oh!

hello-stephen commented Apr 27, 2026

Uh oh!

hello-stephen commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

BiteTheDDDDt commented Apr 27, 2026

What problem does this PR solve?

Problem Summary

Reproduction

Fix

Release note

Check List (For Author)

Uh oh!

Thearas commented Apr 27, 2026

Uh oh!

BiteTheDDDDt commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BiteTheDDDDt commented Apr 27, 2026

Uh oh!

hello-stephen commented Apr 27, 2026

FE UT Coverage Report

Uh oh!

hello-stephen commented Apr 27, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Apr 27, 2026

FE Regression Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants