fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet by andygrove · Pull Request #4190 · apache/datafusion-comet

andygrove · 2026-05-03T17:10:17Z

Which issues does this PR close?

Closes #4136 and #4192 (the native reader - select struct field with user defined schema test is the same missing-all-fields scenario).

Rationale for this change

Spark 4.1 (SPARK-53535) added LEGACY_PARQUET_RETURN_NULL_STRUCT_IF_ALL_FIELDS_MISSING. With the new default (false), Spark's vectorized reader appends a "marker" leaf field to the Parquet read schema so it can distinguish a null parent row from a non-null parent whose requested fields are all missing in the file, then truncates the marker out of the ColumnarBatch.

Comet's parquet_convert_struct_to_struct (in native/core/src/parquet/parquet_support.rs) unconditionally collapsed a non-overlapping struct to a fully-null array via NullBuffer::new_null(array.len()), conflating the two cases. With the new Spark 4.1 default this caused 5 ParquetIOSuite tests to fail because non-null parent rows came back as Row(null) instead of Row(Row(null, null)), and also caused the native reader - select struct field with user defined schema test's third subcase to fail.

What changes are included in this PR?

Adds return_null_struct_if_all_fields_missing to SparkParquetOptions (default true for backwards compat) and plumbs it through both scan paths:

native_datafusion: read in CometNativeScan.convert from the Spark conf, sent via a new NativeScanCommon proto field, threaded through init_datasource_exec.
native_iceberg_compat: read in CometParquetFileFormat, passed through the NativeBatchReader constructor and a new initRecordBatchReader JNI parameter.

For both paths the JVM defaults to "true" on Spark < 4.1 (legacy hardcoded behavior) and "false" on Spark 4.1+ (matches Spark's registered conf default), with explicit user settings honored. When the flag is false and no requested fields overlap with the file's struct, parquet_convert_struct_to_struct now preserves array.nulls() so the parent struct's nullness from the file is propagated.

Four of the five IgnoreComet annotations for the SPARK-53535 tests are removed from dev/diffs/4.1.1.diff. The fifth — SPARK-54220: vectorized reader: missing all struct fields, struct with NullType only — remains ignored because it crashes earlier (at parquet decode) on an unrelated parquet-rs incompatibility with Spark's NullType encoding; that is tracked separately as #4199.

The assume(!isSpark41Plus) guard on the pre-existing native reader - select struct field with user defined schema test is also removed, closing #4192.

The pre-existing CometExpressionSuite#vectorized reader: missing all struct fields test is pinned to spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing=true so it keeps asserting the legacy answer across all Spark versions; the non-legacy semantics are covered by the new issue #4136 test in CometNativeReaderSuite.

How are these changes tested?

New test issue #4136: struct with all requested fields missing in file in CometNativeReaderSuite, parameterized over both native_datafusion and native_iceberg_compat and over offheap × legacy toggles. Both impls fail before this change, both pass after.
Pre-existing native reader - select struct field with user defined schema in the same suite now runs on Spark 4.1 and passes.
Pre-existing vectorized reader: missing all struct fields in CometExpressionSuite pinned to legacy mode continues to pass on both Spark 4.0 and 4.1.
The four un-ignored ParquetIOSuite SPARK-53535 tests will run in the Spark SQL CI workflow.

…g in Parquet Closes apache#4136. Spark 4.1 (SPARK-53535) added LEGACY_PARQUET_RETURN_NULL_STRUCT_IF_ALL_FIELDS_MISSING. With the new default (false), Spark's vectorized reader appends a "marker" leaf field to the Parquet read schema so it can distinguish a null parent row from a non-null parent whose requested fields are all missing in the file, then truncates the marker out of the ColumnarBatch. Comet's `parquet_convert_struct_to_struct` unconditionally collapsed a non-overlapping struct to a fully-null array (`NullBuffer::new_null`), conflating the two cases. This change adds `return_null_struct_if_all_fields_missing` to `SparkParquetOptions` (default `true` for backwards compat) and plumbs it through both scan paths: - native_datafusion: read in CometNativeScan.convert from the Spark conf, pass via the new NativeScanCommon proto field, threaded through init_datasource_exec. - native_iceberg_compat: read in CometParquetFileFormat, pass through the NativeBatchReader constructor and a new initRecordBatchReader JNI parameter. For both paths the JVM defaults to `"true"` on Spark < 4.1 (legacy hardcoded behavior) and `"false"` on Spark 4.1+ (matches Spark's registered conf default), with explicit user settings honored. When the flag is false and no requested fields overlap with the file's struct, parquet_convert_struct_to_struct now preserves `array.nulls()` so the parent's nullness from the file is propagated. Reproducer: new test in CometNativeReaderSuite parameterized over both scan impls. Removes the IgnoreComet annotations from the 5 SPARK-53535 / SPARK-54220 ParquetIOSuite tests in dev/diffs/4.1.1.diff.

…e IgnoreComet.scala CI on Spark 4.1 surfaced two follow-ups to the issue apache#4136 fix: 1. The pre-existing CometExpressionSuite test `vectorized reader: missing all struct fields` expected the legacy answer `Row(null) :: Row(null) :: Row(null)` regardless of Spark version, but did not pin `LEGACY_PARQUET_RETURN_NULL_STRUCT_IF_ALL_FIELDS_MISSING`. With the fix in place Comet correctly mirrors the new Spark 4.1 default (`Row(Row(null,null)) :: Row(Row(null,null)) :: Row(null)`), so the test broke. Pin the conf to "true" to keep the test asserting legacy semantics on every Spark version. The non-legacy semantics are covered by the `issue apache#4136` test in CometNativeReaderSuite, which now also iterates over `COLUMN_VECTOR_OFFHEAP_ENABLED` and the nested-vectorized-reader flag to mirror the upstream Spark `vectorized reader: missing all struct fields` test in `ParquetIOSuite`. 2. The previous regeneration of `dev/diffs/4.1.1.diff` dropped the `IgnoreComet.scala` source file (it's untracked in the Spark working tree, so `git diff` skips it without an explicit `git add`). Spark SQL CI then failed with `object IgnoreCometSuite is not a member of package org.apache.spark.sql`. Re-add the file via `git add` and regenerate so the diff includes its 45-line definition again.

The Spark SQL test suite for 4.1 exercises `SPARK-54220: vectorized reader: missing all struct fields, struct with NullType only`, which writes NullType columns as `BOOLEAN + LogicalType::Unknown`. parquet-rs only accepts `INT32 + Unknown` (parquet-57.2.0/src/schema/types.rs:401) and rejects the Spark-produced file at decode time with: Parquet error: Cannot annotate Unknown from BOOLEAN for field '_1' This is an upstream Spark↔parquet-rs compatibility gap, orthogonal to the missing-all-fields fix this PR lands. Re-add the `IgnoreComet` annotation on just that one test (pointing at the new follow-up apache#4199) and mark the local reproducer in `CometNativeReaderSuite` as canceled with the same reference. The four SPARK-53535 tests remain un-ignored and pass.

) The `native reader - select struct field with user defined schema` test in `CometNativeReaderSuite` is the same missing-all-fields-from-file scenario that this PR already fixes: its third subcase reads `named_struct('a', 0, 'b', 'xyz')` with a read schema of `c0: struct<y:int, x:string>`, where both requested leaves are missing from the file. Drop the `assume(!isSpark41Plus, "apache#4098")` guard so CI exercises it on 4.1; the test passes locally on both `native_datafusion` and `native_iceberg_compat` with the `parquet_convert_struct_to_struct` fix in place. Also closes apache#4192.

andygrove added 2 commits May 3, 2026 11:09

style: rustfmt

43d7a34

andygrove changed the title ~~fix: preserve parent struct nullness when all requested fields missing in Parquet~~ fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet May 3, 2026

andygrove marked this pull request as draft May 3, 2026 17:59

andygrove mentioned this pull request May 3, 2026

Spark 4.1 NullType parquet: parquet-rs rejects BOOLEAN + Unknown logical type #4199

Open

andygrove marked this pull request as ready for review May 4, 2026 00:28

andygrove added the spark 4 label May 4, 2026

andygrove requested review from mbutrovich and parthchandra May 4, 2026 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet#4190

fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet#4190
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:issue-4136

andygrove commented May 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issues does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented May 3, 2026 •

edited

Loading