Skip to content

test: relax bytesRead ratio assertion for Spark 4.1+#4197

Open
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:issue-4194
Open

test: relax bytesRead ratio assertion for Spark 4.1+#4197
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:issue-4194

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4194 (sub-issue of #4098).

Rationale for this change

Three tests in `CometTaskMetricsSuite` were ignored on Spark 4.1+ because Comet's reported `bytesRead` was 6-14× larger than Spark's, breaking the existing 0.7-1.3 ratio assertion.

Investigation: Spark 4.1 changed `ParquetFileFormat` to pre-open the `SeekableInputStream` and read the file footer outside the `FileScanRDD.compute()` thread. Spark's `inputMetrics.bytesRead` is updated from a Hadoop FileSystem thread-local byte counter (`SparkHadoopUtil.getFSBytesReadOnThreadCallback`) that only captures reads on the `compute()` thread — so reads serviced by the pre-opened stream's internal buffer go uncounted.

File size Spark 4.1 `bytesRead` vs actual Comet/Spark ratio
Tiny (whole file pre-buffered) Near 0 10-15×
Small (MB) Significant under-count 2-5×
Medium / Large (10+ MB) Subsequent row-group reads counted Closer to 1×

Comet (via DataFusion's `bytes_scanned`) reports actual file IO, which is the only truthful number to compare against. The existing tight ratio is unrecoverable on Spark 4.1+ for small files without changing what Comet reports — and changing Comet to under-report would be wrong.

What changes are included in this PR?

  • `CometTaskMetricsSuite`: extracted a `assertCometBytesReadInRange` helper and used it from all three affected tests. On Spark <= 4.0 it keeps the tight 0.7-1.3 ratio. On Spark 4.1+ it asserts only `cometBytes >= sparkBytes` and that both are positive. Records read is still checked exactly.
  • Removed the `assume(!isSpark41Plus, ...)` guards from the three previously-skipped tests.
  • `docs/source/user-guide/latest/metrics.md`: added a section explaining the discrepancy, the file-size dependence, and that this is purely observability — `inputMetrics.bytesRead` is not consumed by the planner, the optimizer, or AQE, so plan choices and correctness are unaffected.

How are these changes tested?

  • Full `CometTaskMetricsSuite` (5 tests) passes on Spark 4.1 locally (was 3 failures + 2 passes before).
  • Full `CometTaskMetricsSuite` (5 tests) passes on Spark 4.0 locally (V1 path unchanged).

andygrove added 2 commits May 3, 2026 13:21
Closes apache#4194.

Spark 4.1 changed `ParquetFileFormat` to pre-open the `SeekableInputStream`
and read the file footer outside the `FileScanRDD.compute()` thread. Spark's
`inputMetrics.bytesRead` is updated from a Hadoop FileSystem thread-local
byte counter (`SparkHadoopUtil.getFSBytesReadOnThreadCallback`) that only
captures reads on the compute() thread, so reads served by the pre-opened
stream's internal buffer go uncounted. Spark's reported `bytesRead` is now
5-15x smaller than the actual file IO for the small files used in
`CometTaskMetricsSuite` (the discrepancy shrinks for larger files where
row-group reads cross buffer boundaries).

Comet (via DataFusion's `bytes_scanned`) still reports actual file IO. The
existing 0.7-1.3 ratio assertion treats the truthful Comet value as broken;
on Spark 4.1+ it is the only meaningful number.

Replace the ratio assertion with a Spark-version-aware helper:

- Spark <= 4.0: keep the tight 0.7-1.3 band (semantics unchanged).
- Spark 4.1+: assert only `cometBytes >= sparkBytes` and `> 0`, since the
  Spark side under-reports.

`recordsRead` is unaffected and continues to be checked exactly. The
`assume(!isSpark41Plus, ...)` guards on the three affected tests are
removed.

Adds a doc section in `docs/source/user-guide/latest/metrics.md` describing
the difference, when it applies, and that it is purely observability (does
not affect Spark's planner, optimizer, or AQE).
@andygrove andygrove changed the title test: relax bytesRead ratio assertion for Spark 4.1+ + docs test: relax bytesRead ratio assertion for Spark 4.1+ May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spark 4.1: native_datafusion bytesRead task metric off by 6-14x vs Spark

1 participant