Skip to content

test: add Scala test coverage for spark.sql.optimizer.nestedSchemaPruning.enabled#4185

Draft
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:feat/nested-schema-pruning-tests
Draft

test: add Scala test coverage for spark.sql.optimizer.nestedSchemaPruning.enabled#4185
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:feat/nested-schema-pruning-tests

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

N/A. Audit-driven test coverage; no behavior change.

Stacked on #4183. This PR's branch is based on feat/legacy-time-parser-policy-tests, so until #4183 merges the diff here will include those commits as well.

Rationale for this change

spark.sql.optimizer.nestedSchemaPruning.enabled (default true) is the catalyst-level switch that lets columnar readers fetch only the leaves of a nested column. Comet propagates the flag into Hadoop conf via CometParquetFileFormat.populateConf and otherwise inherits Spark's already-pruned requiredSchema, but Comet's own test tree had no end-to-end coverage. Spark's own ParquetSchemaPruningSuite is patched in dev/diffs/*.diff to recognize Comet scans, but that only validates correctness when CI runs Spark tests, and doesn't lock in plan-level expectations from inside Comet.

A SQL-file test cannot prove pruning happened: it only checks results, and pruned-vs-unpruned reads usually return the same rows. Plan inspection is the only way to catch a regression, so this audit uses Scala tests, mirroring Spark's checkScanSchemata pattern.

What changes are included in this PR?

  • New suite: spark/src/test/scala/org/apache/comet/parquet/CometNestedSchemaPruningSuite.scala. Each scenario runs across SCAN_NATIVE_DATAFUSION and SCAN_NATIVE_ICEBERG_COMPAT under V1 Parquet. A small helper walks the executed plan, collects requiredSchema from any CometScanExec/CometNativeScanExec, and asserts it matches an expected catalog-string schema; results are then compared against Spark via checkSparkAnswer. Scenarios:
    • top-level struct field
    • field inside array of struct
    • field inside map value
    • doubly-nested struct field
    • projection plus filter on a nested field
    • null at an intermediate struct level
  • Plain Parquet V2 is excluded from the matrix because Comet's V2 scan rule only covers CSV and Iceberg, so Parquet V2 stays as plain BatchScanExec and there's no Comet scan to inspect. Documented in the suite's class comment and the audit notes.
  • Append a second entry to docs/source/contributor-guide/spark_configs_support.md with the full audit notes for nestedSchemaPruning.enabled: source semantics, current Comet status, test layout, and findings.

This PR was scaffolded with the project's audit-comet-expression workflow extended to a config-level audit, plus the superpowers:brainstorming and superpowers:using-git-worktrees skills.

How are these changes tested?

  • ./mvnw test -Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" -Dtest=none -- 12/12 pass on Spark 3.5.8 (default).
  • ./mvnw test -Pspark-3.4 -Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" -Dtest=none -- 12/12 pass.
  • ./mvnw test -Pspark-4.0 -Dsuites=\"org.apache.comet.parquet.CometNestedSchemaPruningSuite\" -Dtest=none -- 12/12 pass.

No Comet bugs were uncovered by the audit.

andygrove added 3 commits May 2, 2026 08:16
Audit every Spark expression that reads spark.sql.legacy.timeParserPolicy
(date_format, from_unixtime, unix_timestamp, to_unix_timestamp, to_timestamp,
to_date, and Spark 4's try_to_timestamp) and add CometSqlFileTestSuite
coverage. For each expression provide:

- a ConfigMatrix file exercising convergent inputs under LEGACY, CORRECTED,
  and EXCEPTION
- per-policy files locking in divergent behavior (lenient parsing under
  LEGACY, null returns under CORRECTED, INCONSISTENT_BEHAVIOR_CROSS_VERSION
  under EXCEPTION)

Also add docs/source/contributor-guide/spark_configs_support.md modeled on
the expression audit log to track Spark configs that affect Comet behavior,
with full audit notes for the timeParserPolicy entry.

All 42 generated tests pass on Spark 3.4.3, 3.5.8, and 4.0.1.
…ning.enabled

Add CometNestedSchemaPruningSuite, a focused Scala suite that runs each
scenario across both Comet scan implementations (native_datafusion,
native_iceberg_compat) under the V1 Parquet path. For each scenario the
suite walks the executed plan, extracts requiredSchema from the Comet scan
exec, and asserts the pruned (or unpruned) shape matches the expected
catalogString, then compares results against Spark.

Plain Parquet V2 is excluded because Comet's V2 scan rule only covers CSV
and Iceberg, leaving Parquet V2 as plain BatchScanExec without a Comet
scan to inspect.

Scenarios cover top-level struct field, field inside array of struct,
field inside map value, doubly-nested struct field, projection plus
filter on nested field, and null at an intermediate struct level. Each
scenario exercises both pruning-enabled and pruning-disabled behavior.

Also append a second entry to docs/source/contributor-guide/spark_configs_support.md
with the full audit notes for nestedSchemaPruning.enabled.

All 12 generated test cases pass on Spark 3.4.3, 3.5.8, and 4.0.1.
@andygrove andygrove marked this pull request as draft May 3, 2026 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant