Skip to content

fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet#4190

Open
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:issue-4136
Open

fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet#4190
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:issue-4136

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 3, 2026

Which issues does this PR close?

Closes #4136 and #4192 (the native reader - select struct field with user defined schema test is the same missing-all-fields scenario).

Rationale for this change

Spark 4.1 (SPARK-53535) added LEGACY_PARQUET_RETURN_NULL_STRUCT_IF_ALL_FIELDS_MISSING. With the new default (false), Spark's vectorized reader appends a "marker" leaf field to the Parquet read schema so it can distinguish a null parent row from a non-null parent whose requested fields are all missing in the file, then truncates the marker out of the ColumnarBatch.

Comet's parquet_convert_struct_to_struct (in native/core/src/parquet/parquet_support.rs) unconditionally collapsed a non-overlapping struct to a fully-null array via NullBuffer::new_null(array.len()), conflating the two cases. With the new Spark 4.1 default this caused 5 ParquetIOSuite tests to fail because non-null parent rows came back as Row(null) instead of Row(Row(null, null)), and also caused the native reader - select struct field with user defined schema test's third subcase to fail.

What changes are included in this PR?

Adds return_null_struct_if_all_fields_missing to SparkParquetOptions (default true for backwards compat) and plumbs it through both scan paths:

  • native_datafusion: read in CometNativeScan.convert from the Spark conf, sent via a new NativeScanCommon proto field, threaded through init_datasource_exec.
  • native_iceberg_compat: read in CometParquetFileFormat, passed through the NativeBatchReader constructor and a new initRecordBatchReader JNI parameter.

For both paths the JVM defaults to "true" on Spark < 4.1 (legacy hardcoded behavior) and "false" on Spark 4.1+ (matches Spark's registered conf default), with explicit user settings honored. When the flag is false and no requested fields overlap with the file's struct, parquet_convert_struct_to_struct now preserves array.nulls() so the parent struct's nullness from the file is propagated.

Four of the five IgnoreComet annotations for the SPARK-53535 tests are removed from dev/diffs/4.1.1.diff. The fifth — SPARK-54220: vectorized reader: missing all struct fields, struct with NullType only — remains ignored because it crashes earlier (at parquet decode) on an unrelated parquet-rs incompatibility with Spark's NullType encoding; that is tracked separately as #4199.

The assume(!isSpark41Plus) guard on the pre-existing native reader - select struct field with user defined schema test is also removed, closing #4192.

The pre-existing CometExpressionSuite#vectorized reader: missing all struct fields test is pinned to spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing=true so it keeps asserting the legacy answer across all Spark versions; the non-legacy semantics are covered by the new issue #4136 test in CometNativeReaderSuite.

How are these changes tested?

  • New test issue #4136: struct with all requested fields missing in file in CometNativeReaderSuite, parameterized over both native_datafusion and native_iceberg_compat and over offheap × legacy toggles. Both impls fail before this change, both pass after.
  • Pre-existing native reader - select struct field with user defined schema in the same suite now runs on Spark 4.1 and passes.
  • Pre-existing vectorized reader: missing all struct fields in CometExpressionSuite pinned to legacy mode continues to pass on both Spark 4.0 and 4.1.
  • The four un-ignored ParquetIOSuite SPARK-53535 tests will run in the Spark SQL CI workflow.

andygrove added 2 commits May 3, 2026 11:09
…g in Parquet

Closes apache#4136.

Spark 4.1 (SPARK-53535) added LEGACY_PARQUET_RETURN_NULL_STRUCT_IF_ALL_FIELDS_MISSING.
With the new default (false), Spark's vectorized reader appends a "marker" leaf field
to the Parquet read schema so it can distinguish a null parent row from a non-null
parent whose requested fields are all missing in the file, then truncates the marker
out of the ColumnarBatch.

Comet's `parquet_convert_struct_to_struct` unconditionally collapsed a non-overlapping
struct to a fully-null array (`NullBuffer::new_null`), conflating the two cases.

This change adds `return_null_struct_if_all_fields_missing` to `SparkParquetOptions`
(default `true` for backwards compat) and plumbs it through both scan paths:

- native_datafusion: read in CometNativeScan.convert from the Spark conf, pass via
  the new NativeScanCommon proto field, threaded through init_datasource_exec.
- native_iceberg_compat: read in CometParquetFileFormat, pass through the
  NativeBatchReader constructor and a new initRecordBatchReader JNI parameter.

For both paths the JVM defaults to `"true"` on Spark < 4.1 (legacy hardcoded behavior)
and `"false"` on Spark 4.1+ (matches Spark's registered conf default), with explicit
user settings honored. When the flag is false and no requested fields overlap with the
file's struct, parquet_convert_struct_to_struct now preserves `array.nulls()` so the
parent's nullness from the file is propagated.

Reproducer: new test in CometNativeReaderSuite parameterized over both scan impls.
Removes the IgnoreComet annotations from the 5 SPARK-53535 / SPARK-54220 ParquetIOSuite
tests in dev/diffs/4.1.1.diff.
@andygrove andygrove changed the title fix: preserve parent struct nullness when all requested fields missing in Parquet fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet May 3, 2026
@andygrove andygrove marked this pull request as draft May 3, 2026 17:59
…e IgnoreComet.scala

CI on Spark 4.1 surfaced two follow-ups to the issue apache#4136 fix:

1. The pre-existing CometExpressionSuite test
   `vectorized reader: missing all struct fields` expected the legacy answer
   `Row(null) :: Row(null) :: Row(null)` regardless of Spark version, but did
   not pin `LEGACY_PARQUET_RETURN_NULL_STRUCT_IF_ALL_FIELDS_MISSING`. With
   the fix in place Comet correctly mirrors the new Spark 4.1 default
   (`Row(Row(null,null)) :: Row(Row(null,null)) :: Row(null)`), so the test
   broke. Pin the conf to "true" to keep the test asserting legacy semantics
   on every Spark version. The non-legacy semantics are covered by the
   `issue apache#4136` test in CometNativeReaderSuite, which now also iterates
   over `COLUMN_VECTOR_OFFHEAP_ENABLED` and the nested-vectorized-reader
   flag to mirror the upstream Spark `vectorized reader: missing all struct
   fields` test in `ParquetIOSuite`.

2. The previous regeneration of `dev/diffs/4.1.1.diff` dropped the
   `IgnoreComet.scala` source file (it's untracked in the Spark working
   tree, so `git diff` skips it without an explicit `git add`). Spark SQL
   CI then failed with `object IgnoreCometSuite is not a member of package
   org.apache.spark.sql`. Re-add the file via `git add` and regenerate so
   the diff includes its 45-line definition again.
The Spark SQL test suite for 4.1 exercises `SPARK-54220: vectorized reader:
missing all struct fields, struct with NullType only`, which writes NullType
columns as `BOOLEAN + LogicalType::Unknown`. parquet-rs only accepts
`INT32 + Unknown` (parquet-57.2.0/src/schema/types.rs:401) and rejects the
Spark-produced file at decode time with:

    Parquet error: Cannot annotate Unknown from BOOLEAN for field '_1'

This is an upstream Spark↔parquet-rs compatibility gap, orthogonal to the
missing-all-fields fix this PR lands. Re-add the `IgnoreComet` annotation
on just that one test (pointing at the new follow-up apache#4199) and mark the
local reproducer in `CometNativeReaderSuite` as canceled with the same
reference. The four SPARK-53535 tests remain un-ignored and pass.
@andygrove andygrove marked this pull request as ready for review May 4, 2026 00:28
)

The `native reader - select struct field with user defined schema` test in
`CometNativeReaderSuite` is the same missing-all-fields-from-file scenario
that this PR already fixes: its third subcase reads `named_struct('a',
0, 'b', 'xyz')` with a read schema of `c0: struct<y:int, x:string>`, where
both requested leaves are missing from the file. Drop the
`assume(!isSpark41Plus, "apache#4098")` guard so CI exercises it on 4.1; the test
passes locally on both `native_datafusion` and `native_iceberg_compat` with
the `parquet_convert_struct_to_struct` fix in place.

Also closes apache#4192.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comet native scan returns wrong schema for missing struct fields in Parquet

1 participant