test: relax bytesRead ratio assertion for Spark 4.1+ by andygrove · Pull Request #4197 · apache/datafusion-comet

andygrove · 2026-05-03T19:22:02Z

Which issue does this PR close?

Closes #4194 (sub-issue of #4098).

Rationale for this change

Three tests in `CometTaskMetricsSuite` were ignored on Spark 4.1+ because Comet's reported `bytesRead` was 6-14× larger than Spark's, breaking the existing 0.7-1.3 ratio assertion.

Investigation: Spark 4.1 changed `ParquetFileFormat` to pre-open the `SeekableInputStream` and read the file footer outside the `FileScanRDD.compute()` thread. Spark's `inputMetrics.bytesRead` is updated from a Hadoop FileSystem thread-local byte counter (`SparkHadoopUtil.getFSBytesReadOnThreadCallback`) that only captures reads on the `compute()` thread — so reads serviced by the pre-opened stream's internal buffer go uncounted.

File size	Spark 4.1 `bytesRead` vs actual	Comet/Spark ratio
Tiny (whole file pre-buffered)	Near 0	10-15×
Small (MB)	Significant under-count	2-5×
Medium / Large (10+ MB)	Subsequent row-group reads counted	Closer to 1×

Comet (via DataFusion's `bytes_scanned`) reports actual file IO, which is the only truthful number to compare against. The existing tight ratio is unrecoverable on Spark 4.1+ for small files without changing what Comet reports — and changing Comet to under-report would be wrong.

What changes are included in this PR?

`CometTaskMetricsSuite`: extracted a `assertCometBytesReadInRange` helper and used it from all three affected tests. On Spark <= 4.0 it keeps the tight 0.7-1.3 ratio. On Spark 4.1+ it asserts only `cometBytes >= sparkBytes` and that both are positive. Records read is still checked exactly.
Removed the `assume(!isSpark41Plus, ...)` guards from the three previously-skipped tests.
`docs/source/user-guide/latest/metrics.md`: added a section explaining the discrepancy, the file-size dependence, and that this is purely observability — `inputMetrics.bytesRead` is not consumed by the planner, the optimizer, or AQE, so plan choices and correctness are unaffected.

How are these changes tested?

Full `CometTaskMetricsSuite` (5 tests) passes on Spark 4.1 locally (was 3 failures + 2 passes before).
Full `CometTaskMetricsSuite` (5 tests) passes on Spark 4.0 locally (V1 path unchanged).

Closes apache#4194. Spark 4.1 changed `ParquetFileFormat` to pre-open the `SeekableInputStream` and read the file footer outside the `FileScanRDD.compute()` thread. Spark's `inputMetrics.bytesRead` is updated from a Hadoop FileSystem thread-local byte counter (`SparkHadoopUtil.getFSBytesReadOnThreadCallback`) that only captures reads on the compute() thread, so reads served by the pre-opened stream's internal buffer go uncounted. Spark's reported `bytesRead` is now 5-15x smaller than the actual file IO for the small files used in `CometTaskMetricsSuite` (the discrepancy shrinks for larger files where row-group reads cross buffer boundaries). Comet (via DataFusion's `bytes_scanned`) still reports actual file IO. The existing 0.7-1.3 ratio assertion treats the truthful Comet value as broken; on Spark 4.1+ it is the only meaningful number. Replace the ratio assertion with a Spark-version-aware helper: - Spark <= 4.0: keep the tight 0.7-1.3 band (semantics unchanged). - Spark 4.1+: assert only `cometBytes >= sparkBytes` and `> 0`, since the Spark side under-reports. `recordsRead` is unaffected and continues to be checked exactly. The `assume(!isSpark41Plus, ...)` guards on the three affected tests are removed. Adds a doc section in `docs/source/user-guide/latest/metrics.md` describing the difference, when it applies, and that it is purely observability (does not affect Spark's planner, optimizer, or AQE).

andygrove added 2 commits May 3, 2026 13:21

docs: drop bytesRead ratio table

8a117a2

andygrove changed the title ~~test: relax bytesRead ratio assertion for Spark 4.1+ + docs~~ test: relax bytesRead ratio assertion for Spark 4.1+ May 3, 2026

andygrove mentioned this pull request May 3, 2026

macOS aarch64 flake: SIGBUS in _pthread_tsd_cleanup after ParquetReadFromFakeHadoopFsSuite #4200

Open

andygrove added spark 4.0 spark 4.1 and removed spark 4.0 labels May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: relax bytesRead ratio assertion for Spark 4.1+#4197

test: relax bytesRead ratio assertion for Spark 4.1+#4197
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:issue-4194

andygrove commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 3, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant