Skip to content

[fix](nereids) Do not push offset-only access path for CHAR columns#62854

Open
BiteTheDDDDt wants to merge 2 commits intoapache:masterfrom
BiteTheDDDDt:fix-length-of-char-offset-only
Open

[fix](nereids) Do not push offset-only access path for CHAR columns#62854
BiteTheDDDDt wants to merge 2 commits intoapache:masterfrom
BiteTheDDDDt:fix-length-of-char-offset-only

Conversation

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Related PR: #62205

Problem Summary

PR #62205 optimizes length(str_col) by pushing an OFFSET access path to BE so the chars buffer can be skipped. The optimization predicate is isStringLikeType(), which incorrectly includes CHAR(N).

CHAR is stored padded to its declared length on BE (see OlapColumnDataConvertorChar::clone_and_padding). The per-row length recorded in dict word info / page headers is therefore always the padded length (e.g. 25 for CHAR(25)), never the logical length expected by length(). Recovering the logical length requires scanning the chars buffer with strnlen() (shrink_padding_chars), which OFFSET_ONLY by definition skips.

In addition, when SegmentIterator runs shrink_char_type_column_suffix_zero on the CHAR column, it reads the (only-resized, never-initialized) chars buffer with strnlen and rewrites the offsets, corrupting them with non-deterministic values. As a result, select length(c) from t on a CHAR column returns garbage (mix of 0 / random small numbers / padded length 25), depending on what was previously in heap memory.

Reproduction

CREATE TABLE t (
    pk INT,
    c CHAR(25) NOT NULL
) DUPLICATE KEY(pk) DISTRIBUTED BY HASH(pk) BUCKETS 1
PROPERTIES('replication_num'='1');

INSERT INTO t VALUES (1,'a'), (2,'bb'), (3,'ccc');

-- Wrong (mix of 0 / 25 / random):
SELECT length(c) FROM t;
-- Correct (1, 2, 3):
SELECT c, length(c) FROM t;

Fix

  1. FE: in AccessPathExpressionCollector.visitLength, skip the OFFSET suffix when the resolved leaf type is CHAR. Because the check is on arg.getDataType() (the type after struct field / map subscript / array index resolution), a single guard covers top-level CHAR as well as nested CHAR (struct.char_field, map<X,char>['k'], array<char>[i]).
  2. BE: in StringFileColumnIterator::set_access_paths, hard-fail with Status::InternalError when an OFFSET access path is received for a CHAR column. This catches any future planner regression that breaks the contract above. A small protected accessor get_reader() is added to FileColumnIterator so the derived string iterator can inspect the storage field type.

VARCHAR / STRING are not padded, so OFFSET_ONLY remains correct for them. Other types (VARBINARY, JSONB, HLL, BITMAP, Variant) are not isStringLikeType(), so FE never emits OFFSET for them. Container length / cardinality (length(arr) / cardinality(map)) use their own offset stream as canonical metadata and are unaffected.

Release note

Fix wrong / non-deterministic result of length() on CHAR columns when no other column from the same row is projected (e.g. select length(c) from t or select length(c), count(*) from t group by length(c)).

Check List (For Author)

  • Test:
    • Manual test (mysql client, repro from a 200-row CHAR(25) table)
    • Regression Test — to be added in a follow-up
  • Behavior changed: Yes — length(CHAR) now returns the correct logical length; previously it could return 0, the padded length, or other garbage.
  • Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#62205

Problem Summary:
PR apache#62205 optimizes `length(str_col)` by pushing an OFFSET access path
to BE so the chars buffer can be skipped. The optimization predicate is
`isStringLikeType()`, which incorrectly includes CHAR(N).

CHAR is stored padded to its declared length on BE (see
OlapColumnDataConvertorChar::clone_and_padding). The per-row length
recorded in dict word info / page headers is therefore always the
padded length (e.g. 25 for CHAR(25)), never the logical length expected
by length(). Recovering the logical length requires scanning the chars
buffer with strnlen() (shrink_padding_chars), which OFFSET_ONLY by
definition skips.

In addition, when SegmentIterator runs shrink_char_type_column_suffix_zero
on the CHAR column, it reads the (only-resized, never-initialized) chars
buffer with strnlen and rewrites the offsets, corrupting them with
non-deterministic values. As a result, `select length(c) from t` on a
CHAR column returns garbage (mix of 0 / random small numbers / padded
length 25), depending on what was previously in heap memory.

Fix:
1. FE: in AccessPathExpressionCollector.visitLength, skip the OFFSET
   suffix when the resolved leaf type is CHAR. Because the check is on
   `arg.getDataType()` (the type after struct field / map subscript /
   array index resolution), a single guard covers top-level CHAR as
   well as nested CHAR (struct.char_field, map<X,char>['k'],
   array<char>[i]).
2. BE: in StringFileColumnIterator::set_access_paths, hard-fail with
   Status::InternalError when an OFFSET access path is received for a
   CHAR column. This catches any future planner regression that breaks
   the contract above. A small protected accessor get_reader() is added
   to FileColumnIterator so the derived string iterator can inspect the
   storage field type.

VARCHAR / STRING are not padded, so OFFSET_ONLY remains correct for
them. Other types (VARBINARY, JSONB, HLL, BITMAP, Variant) are not
`isStringLikeType()`, so FE never emits OFFSET for them. Container
length / cardinality (length(arr) / cardinality(map)) use their own
offset stream as canonical metadata and are unaffected.

### Release note

Fix wrong / non-deterministic result of length() on CHAR columns when
no other column from the same row is projected (e.g. `select length(c)
from t` or `select length(c), count(*) from t group by length(c)`).

### Check List (For Author)

- Test: Manual test (mysql client, repro from a 200-row CHAR(25)
  table). No regression test added in this PR; will follow up.
- Behavior changed: Yes (length(CHAR) now returns the correct logical
  length; previously it could return 0, the padded length, or other
  garbage).
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 27, 2026 04:09
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 27, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a correctness bug introduced by the length(str_col) OFFSET-only pruning optimization by ensuring CHAR columns never use OFFSET-only access paths (since CHAR storage is padded and logical length cannot be recovered from offsets alone).

Changes:

  • FE: Update AccessPathExpressionCollector.visitLength to avoid emitting the OFFSET suffix when the resolved leaf type is CHAR.
  • BE: Add a protected FileColumnIterator::get_reader() accessor and make StringFileColumnIterator::set_access_paths hard-fail if it receives OFFSET_ONLY for a CHAR column.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathExpressionCollector.java Excludes CHAR from the length() OFFSET-only access path emission to preserve correctness.
be/src/storage/segment/column_reader.h Exposes ColumnReader to derived iterators via a protected accessor for metadata checks.
be/src/storage/segment/column_reader.cpp Enforces FE/BE contract by rejecting OFFSET_ONLY on CHAR at runtime (InternalError).
Comments suppressed due to low confidence (1)

fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathExpressionCollector.java:176

  • Please add/extend FE unit coverage to lock in this CHAR-specific behavior (i.e., ensure length(char_col) does NOT emit an OFFSET access path, while length(varchar/string) still does). There are existing nested column pruning tests under fe/fe-core/src/test/java/org/apache/doris/nereids/rules/rewrite/PruneNestedColumnTest.java, but none currently assert this CHAR exclusion, so this could regress silently.
        if (arg.getDataType().isStringLikeType() && !arg.getDataType().isCharType()
                && context.accessPathBuilder.isEmpty()) {
            CollectorContext offsetContext =
                    new CollectorContext(context.statementContext, context.bottomFilter);
            offsetContext.accessPathBuilder.addSuffix(AccessPathInfo.ACCESS_STRING_OFFSET);
            return arg.accept(this, offsetContext);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1971 to +1978
if (read_offset_only() && get_reader() != nullptr &&
get_reader()->get_meta_type() == FieldType::OLAP_FIELD_TYPE_CHAR) {
return Status::InternalError(
"OFFSET_ONLY access path is not supported on CHAR column '{}': CHAR is stored "
"padded so the per-row length information available without reading the chars "
"buffer is always the padded length, not the logical length. The FE planner "
"must not emit an OFFSET access path for CHAR columns.",
_column_name);
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a BE unit test to cover this new contract enforcement (OFFSET_ONLY + CHAR => InternalError). There are already iterator access-path tests in be/test/storage/segment/column_reader_test.cpp, so adding a regression case here would prevent accidental reintroduction of OFFSET paths for CHAR (whether via planner regressions or future refactors).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 blocking issues.

  1. Mixed-version FE/BE compatibility is broken by the new BE-side hard error: an old FE still emits [col, OFFSET] for length(char_col), so upgrading BE first turns this from a wrong-result bug into a query failure.
  2. The fix ships without automated coverage: there is no FE unit test for the CHAR access-path change and no regression test for the SQL reproduction in the PR description.

Critical checkpoints:

  • Goal of current task: prevent wrong or non-deterministic length(CHAR) results when only the length is projected. The FE-side change addresses the fully upgraded path, but the BE-side hard failure makes mixed-version execution fail instead. No automated test currently proves the fix.
  • Change scope: small and focused.
  • Concurrency: no new concurrency or locking concerns in the touched code.
  • Lifecycle or static init: no special lifecycle issues introduced.
  • Configuration: no new configs.
  • Compatibility: not satisfied. This PR changes the FE/BE contract for OFFSET access paths on CHAR without a smooth-upgrade or backward-compatible path.
  • Parallel code paths: the updated visitLength() check appears to cover nested CHAR leaves as intended; I did not find another new planner path in this PR that still purposely sends OFFSET for CHAR.
  • Special conditionals: the CHAR exclusion itself is well justified in comments.
  • Test coverage: insufficient. Manual verification only; missing FE unit coverage and regression coverage for the SQL repro.
  • Test outputs: no new or updated test outputs.
  • Observability: the BE error text is clear if triggered.
  • Transaction, persistence, data writes, and FE-BE variable passing: not applicable in the touched code.
  • Performance: for CHAR, reading the chars buffer is necessary for correctness; I did not find a separate performance regression beyond that.
  • Other issues: none beyond the two findings above.

User focus:

  • No additional user-provided review focus was supplied, so there were no extra focus-specific findings beyond the full review.

Comment thread be/src/storage/segment/column_reader.cpp
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/2) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 30.00% (3/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.86% (27694/37497)
Line Coverage 57.57% (299508/520243)
Region Coverage 54.66% (248741/455086)
Branch Coverage 56.29% (107799/191502)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/46) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants