fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet#4190
Open
andygrove wants to merge 5 commits intoapache:mainfrom
Open
fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet#4190andygrove wants to merge 5 commits intoapache:mainfrom
andygrove wants to merge 5 commits intoapache:mainfrom
Conversation
…g in Parquet Closes apache#4136. Spark 4.1 (SPARK-53535) added LEGACY_PARQUET_RETURN_NULL_STRUCT_IF_ALL_FIELDS_MISSING. With the new default (false), Spark's vectorized reader appends a "marker" leaf field to the Parquet read schema so it can distinguish a null parent row from a non-null parent whose requested fields are all missing in the file, then truncates the marker out of the ColumnarBatch. Comet's `parquet_convert_struct_to_struct` unconditionally collapsed a non-overlapping struct to a fully-null array (`NullBuffer::new_null`), conflating the two cases. This change adds `return_null_struct_if_all_fields_missing` to `SparkParquetOptions` (default `true` for backwards compat) and plumbs it through both scan paths: - native_datafusion: read in CometNativeScan.convert from the Spark conf, pass via the new NativeScanCommon proto field, threaded through init_datasource_exec. - native_iceberg_compat: read in CometParquetFileFormat, pass through the NativeBatchReader constructor and a new initRecordBatchReader JNI parameter. For both paths the JVM defaults to `"true"` on Spark < 4.1 (legacy hardcoded behavior) and `"false"` on Spark 4.1+ (matches Spark's registered conf default), with explicit user settings honored. When the flag is false and no requested fields overlap with the file's struct, parquet_convert_struct_to_struct now preserves `array.nulls()` so the parent's nullness from the file is propagated. Reproducer: new test in CometNativeReaderSuite parameterized over both scan impls. Removes the IgnoreComet annotations from the 5 SPARK-53535 / SPARK-54220 ParquetIOSuite tests in dev/diffs/4.1.1.diff.
…e IgnoreComet.scala CI on Spark 4.1 surfaced two follow-ups to the issue apache#4136 fix: 1. The pre-existing CometExpressionSuite test `vectorized reader: missing all struct fields` expected the legacy answer `Row(null) :: Row(null) :: Row(null)` regardless of Spark version, but did not pin `LEGACY_PARQUET_RETURN_NULL_STRUCT_IF_ALL_FIELDS_MISSING`. With the fix in place Comet correctly mirrors the new Spark 4.1 default (`Row(Row(null,null)) :: Row(Row(null,null)) :: Row(null)`), so the test broke. Pin the conf to "true" to keep the test asserting legacy semantics on every Spark version. The non-legacy semantics are covered by the `issue apache#4136` test in CometNativeReaderSuite, which now also iterates over `COLUMN_VECTOR_OFFHEAP_ENABLED` and the nested-vectorized-reader flag to mirror the upstream Spark `vectorized reader: missing all struct fields` test in `ParquetIOSuite`. 2. The previous regeneration of `dev/diffs/4.1.1.diff` dropped the `IgnoreComet.scala` source file (it's untracked in the Spark working tree, so `git diff` skips it without an explicit `git add`). Spark SQL CI then failed with `object IgnoreCometSuite is not a member of package org.apache.spark.sql`. Re-add the file via `git add` and regenerate so the diff includes its 45-line definition again.
The Spark SQL test suite for 4.1 exercises `SPARK-54220: vectorized reader:
missing all struct fields, struct with NullType only`, which writes NullType
columns as `BOOLEAN + LogicalType::Unknown`. parquet-rs only accepts
`INT32 + Unknown` (parquet-57.2.0/src/schema/types.rs:401) and rejects the
Spark-produced file at decode time with:
Parquet error: Cannot annotate Unknown from BOOLEAN for field '_1'
This is an upstream Spark↔parquet-rs compatibility gap, orthogonal to the
missing-all-fields fix this PR lands. Re-add the `IgnoreComet` annotation
on just that one test (pointing at the new follow-up apache#4199) and mark the
local reproducer in `CometNativeReaderSuite` as canceled with the same
reference. The four SPARK-53535 tests remain un-ignored and pass.
) The `native reader - select struct field with user defined schema` test in `CometNativeReaderSuite` is the same missing-all-fields-from-file scenario that this PR already fixes: its third subcase reads `named_struct('a', 0, 'b', 'xyz')` with a read schema of `c0: struct<y:int, x:string>`, where both requested leaves are missing from the file. Drop the `assume(!isSpark41Plus, "apache#4098")` guard so CI exercises it on 4.1; the test passes locally on both `native_datafusion` and `native_iceberg_compat` with the `parquet_convert_struct_to_struct` fix in place. Also closes apache#4192.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issues does this PR close?
Closes #4136 and #4192 (the
native reader - select struct field with user defined schematest is the same missing-all-fields scenario).Rationale for this change
Spark 4.1 (SPARK-53535) added
LEGACY_PARQUET_RETURN_NULL_STRUCT_IF_ALL_FIELDS_MISSING. With the new default (false), Spark's vectorized reader appends a "marker" leaf field to the Parquet read schema so it can distinguish a null parent row from a non-null parent whose requested fields are all missing in the file, then truncates the marker out of theColumnarBatch.Comet's
parquet_convert_struct_to_struct(innative/core/src/parquet/parquet_support.rs) unconditionally collapsed a non-overlapping struct to a fully-null array viaNullBuffer::new_null(array.len()), conflating the two cases. With the new Spark 4.1 default this caused 5ParquetIOSuitetests to fail because non-null parent rows came back asRow(null)instead ofRow(Row(null, null)), and also caused thenative reader - select struct field with user defined schematest's third subcase to fail.What changes are included in this PR?
Adds
return_null_struct_if_all_fields_missingtoSparkParquetOptions(defaulttruefor backwards compat) and plumbs it through both scan paths:CometNativeScan.convertfrom the Spark conf, sent via a newNativeScanCommonproto field, threaded throughinit_datasource_exec.CometParquetFileFormat, passed through theNativeBatchReaderconstructor and a newinitRecordBatchReaderJNI parameter.For both paths the JVM defaults to
"true"on Spark < 4.1 (legacy hardcoded behavior) and"false"on Spark 4.1+ (matches Spark's registered conf default), with explicit user settings honored. When the flag isfalseand no requested fields overlap with the file's struct,parquet_convert_struct_to_structnow preservesarray.nulls()so the parent struct's nullness from the file is propagated.Four of the five
IgnoreCometannotations for the SPARK-53535 tests are removed fromdev/diffs/4.1.1.diff. The fifth —SPARK-54220: vectorized reader: missing all struct fields, struct with NullType only— remains ignored because it crashes earlier (at parquet decode) on an unrelated parquet-rs incompatibility with Spark'sNullTypeencoding; that is tracked separately as #4199.The
assume(!isSpark41Plus)guard on the pre-existingnative reader - select struct field with user defined schematest is also removed, closing #4192.The pre-existing
CometExpressionSuite#vectorized reader: missing all struct fieldstest is pinned tospark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing=trueso it keeps asserting the legacy answer across all Spark versions; the non-legacy semantics are covered by the newissue #4136test inCometNativeReaderSuite.How are these changes tested?
issue #4136: struct with all requested fields missing in fileinCometNativeReaderSuite, parameterized over bothnative_datafusionandnative_iceberg_compatand overoffheap×legacytoggles. Both impls fail before this change, both pass after.native reader - select struct field with user defined schemain the same suite now runs on Spark 4.1 and passes.vectorized reader: missing all struct fieldsinCometExpressionSuitepinned to legacy mode continues to pass on both Spark 4.0 and 4.1.ParquetIOSuiteSPARK-53535 tests will run in the Spark SQL CI workflow.