test: add Scala test coverage for spark.sql.optimizer.nestedSchemaPruning.enabled by andygrove · Pull Request #4185 · apache/datafusion-comet

andygrove · 2026-05-02T15:21:39Z

Which issue does this PR close?

N/A. Audit-driven test coverage; no behavior change.

Stacked on #4183. This PR's branch is based on feat/legacy-time-parser-policy-tests, so until #4183 merges the diff here will include those commits as well.

Rationale for this change

spark.sql.optimizer.nestedSchemaPruning.enabled (default true) is the catalyst-level switch that lets columnar readers fetch only the leaves of a nested column. Comet propagates the flag into Hadoop conf via CometParquetFileFormat.populateConf and otherwise inherits Spark's already-pruned requiredSchema, but Comet's own test tree had no end-to-end coverage. Spark's own ParquetSchemaPruningSuite is patched in dev/diffs/*.diff to recognize Comet scans, but that only validates correctness when CI runs Spark tests, and doesn't lock in plan-level expectations from inside Comet.

A SQL-file test cannot prove pruning happened: it only checks results, and pruned-vs-unpruned reads usually return the same rows. Plan inspection is the only way to catch a regression, so this audit uses Scala tests, mirroring Spark's checkScanSchemata pattern.

What changes are included in this PR?

New suite: spark/src/test/scala/org/apache/comet/parquet/CometNestedSchemaPruningSuite.scala. Each scenario runs across SCAN_NATIVE_DATAFUSION and SCAN_NATIVE_ICEBERG_COMPAT under V1 Parquet. A small helper walks the executed plan, collects requiredSchema from any CometScanExec/CometNativeScanExec, and asserts it matches an expected catalog-string schema; results are then compared against Spark via checkSparkAnswer. Scenarios:
- top-level struct field
- field inside array of struct
- field inside map value
- doubly-nested struct field
- projection plus filter on a nested field
- null at an intermediate struct level
Plain Parquet V2 is excluded from the matrix because Comet's V2 scan rule only covers CSV and Iceberg, so Parquet V2 stays as plain BatchScanExec and there's no Comet scan to inspect. Documented in the suite's class comment and the audit notes.
Append a second entry to docs/source/contributor-guide/spark_configs_support.md with the full audit notes for nestedSchemaPruning.enabled: source semantics, current Comet status, test layout, and findings.

This PR was scaffolded with the project's audit-comet-expression workflow extended to a config-level audit, plus the superpowers:brainstorming and superpowers:using-git-worktrees skills.

How are these changes tested?

./mvnw test -Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" -Dtest=none -- 12/12 pass on Spark 3.5.8 (default).
./mvnw test -Pspark-3.4 -Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" -Dtest=none -- 12/12 pass.
./mvnw test -Pspark-4.0 -Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" -Dtest=none -- 12/12 pass.

No Comet bugs were uncovered by the audit.

Audit every Spark expression that reads spark.sql.legacy.timeParserPolicy (date_format, from_unixtime, unix_timestamp, to_unix_timestamp, to_timestamp, to_date, and Spark 4's try_to_timestamp) and add CometSqlFileTestSuite coverage. For each expression provide: - a ConfigMatrix file exercising convergent inputs under LEGACY, CORRECTED, and EXCEPTION - per-policy files locking in divergent behavior (lenient parsing under LEGACY, null returns under CORRECTED, INCONSISTENT_BEHAVIOR_CROSS_VERSION under EXCEPTION) Also add docs/source/contributor-guide/spark_configs_support.md modeled on the expression audit log to track Spark configs that affect Comet behavior, with full audit notes for the timeParserPolicy entry. All 42 generated tests pass on Spark 3.4.3, 3.5.8, and 4.0.1.

…ning.enabled Add CometNestedSchemaPruningSuite, a focused Scala suite that runs each scenario across both Comet scan implementations (native_datafusion, native_iceberg_compat) under the V1 Parquet path. For each scenario the suite walks the executed plan, extracts requiredSchema from the Comet scan exec, and asserts the pruned (or unpruned) shape matches the expected catalogString, then compares results against Spark. Plain Parquet V2 is excluded because Comet's V2 scan rule only covers CSV and Iceberg, leaving Parquet V2 as plain BatchScanExec without a Comet scan to inspect. Scenarios cover top-level struct field, field inside array of struct, field inside map value, doubly-nested struct field, projection plus filter on nested field, and null at an intermediate struct level. Each scenario exercises both pruning-enabled and pruning-disabled behavior. Also append a second entry to docs/source/contributor-guide/spark_configs_support.md with the full audit notes for nestedSchemaPruning.enabled. All 12 generated test cases pass on Spark 3.4.3, 3.5.8, and 4.0.1.

andygrove added 3 commits May 2, 2026 08:16

docs: replace config support table with bullet list

bf9c035

andygrove marked this pull request as draft May 3, 2026 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add Scala test coverage for spark.sql.optimizer.nestedSchemaPruning.enabled#4185

test: add Scala test coverage for spark.sql.optimizer.nestedSchemaPruning.enabled#4185
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:feat/nested-schema-pruning-tests

andygrove commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 2, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant