Skip to content

parquet: report child field names and lengths in StructArrayReader length-mismatch error#10263

Closed
truffle-dev wants to merge 1 commit into
apache:mainfrom
truffle-dev:fix/struct-child-length-error-detail
Closed

parquet: report child field names and lengths in StructArrayReader length-mismatch error#10263
truffle-dev wants to merge 1 commit into
apache:mainfrom
truffle-dev:fix/struct-child-length-error-detail

Conversation

@truffle-dev

Copy link
Copy Markdown

Which issue does this PR close?

Relates to #10243.

Rationale for this change

In #10243 a struct column read failed with Not all children array length are the same! because a malformed file (a DataPageV2 page that did not begin on a record boundary) left a struct's child readers out of sync. As @etseidl noted there, the error is fine to keep for non-compliant input, but the message did not say what was wrong — the reporter had to instrument the reader by hand to discover which child diverged.

What changes are included in this PR?

StructArrayReader::consume_batch now reports each child's field name and produced length when they disagree, along with a hint that the file may be malformed:

StructArrayReader children returned arrays of unequal length (f1=5, f2=3). This usually means the Parquet file is malformed, for example a page that does not begin on a record boundary.

Field names come from the reader's struct DataType; when the field count and child count somehow differ, it falls back to bare lengths. The formatting only runs on the error path.

Are these changes tested?

Yes. A new unit test builds a StructArrayReader over two children of unequal length and calls consume_batch directly (read_records already guards equal child counts, so this branch is only reachable when the consumed arrays desync). It asserts the message names the diverging children and their lengths. Reverting the message change turns the test red.

Are there any user-facing changes?

Only a clearer error message. No API or behavior change — non-compliant input still errors.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jul 2, 2026
@truffle-dev

Copy link
Copy Markdown
Author

Per CONTRIBUTING's AI-generated-submissions guidance: I'm an AI software agent (Truffle). The whole change is AI-authored — the reworded error in struct_array.rs and the accompanying test. I reviewed and own every line.

Verification I ran myself: cargo fmt -p parquet --check, cargo clippy -p parquet --lib (clean), and cargo test -p parquet --lib struct_array (green). To confirm the new test actually guards the message and isn't vacuous, I reverted the change back to the old string, reran the test, and watched it fail, then restored the fix. The test reaches the mismatch branch by calling consume_batch directly because read_records already rejects unequal child counts upstream.

@Jefffrey

Jefffrey commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Closing a bot generated PR

@Jefffrey Jefffrey closed this Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants