test: add Scala test coverage for spark.sql.optimizer.nestedSchemaPruning.enabled#4185
Draft
andygrove wants to merge 3 commits intoapache:mainfrom
Draft
test: add Scala test coverage for spark.sql.optimizer.nestedSchemaPruning.enabled#4185andygrove wants to merge 3 commits intoapache:mainfrom
andygrove wants to merge 3 commits intoapache:mainfrom
Conversation
Audit every Spark expression that reads spark.sql.legacy.timeParserPolicy (date_format, from_unixtime, unix_timestamp, to_unix_timestamp, to_timestamp, to_date, and Spark 4's try_to_timestamp) and add CometSqlFileTestSuite coverage. For each expression provide: - a ConfigMatrix file exercising convergent inputs under LEGACY, CORRECTED, and EXCEPTION - per-policy files locking in divergent behavior (lenient parsing under LEGACY, null returns under CORRECTED, INCONSISTENT_BEHAVIOR_CROSS_VERSION under EXCEPTION) Also add docs/source/contributor-guide/spark_configs_support.md modeled on the expression audit log to track Spark configs that affect Comet behavior, with full audit notes for the timeParserPolicy entry. All 42 generated tests pass on Spark 3.4.3, 3.5.8, and 4.0.1.
…ning.enabled Add CometNestedSchemaPruningSuite, a focused Scala suite that runs each scenario across both Comet scan implementations (native_datafusion, native_iceberg_compat) under the V1 Parquet path. For each scenario the suite walks the executed plan, extracts requiredSchema from the Comet scan exec, and asserts the pruned (or unpruned) shape matches the expected catalogString, then compares results against Spark. Plain Parquet V2 is excluded because Comet's V2 scan rule only covers CSV and Iceberg, leaving Parquet V2 as plain BatchScanExec without a Comet scan to inspect. Scenarios cover top-level struct field, field inside array of struct, field inside map value, doubly-nested struct field, projection plus filter on nested field, and null at an intermediate struct level. Each scenario exercises both pruning-enabled and pruning-disabled behavior. Also append a second entry to docs/source/contributor-guide/spark_configs_support.md with the full audit notes for nestedSchemaPruning.enabled. All 12 generated test cases pass on Spark 3.4.3, 3.5.8, and 4.0.1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
N/A. Audit-driven test coverage; no behavior change.
Stacked on #4183. This PR's branch is based on
feat/legacy-time-parser-policy-tests, so until #4183 merges the diff here will include those commits as well.Rationale for this change
spark.sql.optimizer.nestedSchemaPruning.enabled(defaulttrue) is the catalyst-level switch that lets columnar readers fetch only the leaves of a nested column. Comet propagates the flag into Hadoop conf viaCometParquetFileFormat.populateConfand otherwise inherits Spark's already-prunedrequiredSchema, but Comet's own test tree had no end-to-end coverage. Spark's ownParquetSchemaPruningSuiteis patched indev/diffs/*.diffto recognize Comet scans, but that only validates correctness when CI runs Spark tests, and doesn't lock in plan-level expectations from inside Comet.A SQL-file test cannot prove pruning happened: it only checks results, and pruned-vs-unpruned reads usually return the same rows. Plan inspection is the only way to catch a regression, so this audit uses Scala tests, mirroring Spark's
checkScanSchematapattern.What changes are included in this PR?
spark/src/test/scala/org/apache/comet/parquet/CometNestedSchemaPruningSuite.scala. Each scenario runs acrossSCAN_NATIVE_DATAFUSIONandSCAN_NATIVE_ICEBERG_COMPATunder V1 Parquet. A small helper walks the executed plan, collectsrequiredSchemafrom anyCometScanExec/CometNativeScanExec, and asserts it matches an expected catalog-string schema; results are then compared against Spark viacheckSparkAnswer. Scenarios:BatchScanExecand there's no Comet scan to inspect. Documented in the suite's class comment and the audit notes.docs/source/contributor-guide/spark_configs_support.mdwith the full audit notes fornestedSchemaPruning.enabled: source semantics, current Comet status, test layout, and findings.This PR was scaffolded with the project's
audit-comet-expressionworkflow extended to a config-level audit, plus thesuperpowers:brainstormingandsuperpowers:using-git-worktreesskills.How are these changes tested?
./mvnw test -Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" -Dtest=none-- 12/12 pass on Spark 3.5.8 (default)../mvnw test -Pspark-3.4 -Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" -Dtest=none-- 12/12 pass../mvnw test -Pspark-4.0 -Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" -Dtest=none-- 12/12 pass.No Comet bugs were uncovered by the audit.