test: relax bytesRead ratio assertion for Spark 4.1+#4197
Open
andygrove wants to merge 2 commits intoapache:mainfrom
Open
test: relax bytesRead ratio assertion for Spark 4.1+#4197andygrove wants to merge 2 commits intoapache:mainfrom
andygrove wants to merge 2 commits intoapache:mainfrom
Conversation
Closes apache#4194. Spark 4.1 changed `ParquetFileFormat` to pre-open the `SeekableInputStream` and read the file footer outside the `FileScanRDD.compute()` thread. Spark's `inputMetrics.bytesRead` is updated from a Hadoop FileSystem thread-local byte counter (`SparkHadoopUtil.getFSBytesReadOnThreadCallback`) that only captures reads on the compute() thread, so reads served by the pre-opened stream's internal buffer go uncounted. Spark's reported `bytesRead` is now 5-15x smaller than the actual file IO for the small files used in `CometTaskMetricsSuite` (the discrepancy shrinks for larger files where row-group reads cross buffer boundaries). Comet (via DataFusion's `bytes_scanned`) still reports actual file IO. The existing 0.7-1.3 ratio assertion treats the truthful Comet value as broken; on Spark 4.1+ it is the only meaningful number. Replace the ratio assertion with a Spark-version-aware helper: - Spark <= 4.0: keep the tight 0.7-1.3 band (semantics unchanged). - Spark 4.1+: assert only `cometBytes >= sparkBytes` and `> 0`, since the Spark side under-reports. `recordsRead` is unaffected and continues to be checked exactly. The `assume(!isSpark41Plus, ...)` guards on the three affected tests are removed. Adds a doc section in `docs/source/user-guide/latest/metrics.md` describing the difference, when it applies, and that it is purely observability (does not affect Spark's planner, optimizer, or AQE).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #4194 (sub-issue of #4098).
Rationale for this change
Three tests in `CometTaskMetricsSuite` were ignored on Spark 4.1+ because Comet's reported `bytesRead` was 6-14× larger than Spark's, breaking the existing 0.7-1.3 ratio assertion.
Investigation: Spark 4.1 changed `ParquetFileFormat` to pre-open the `SeekableInputStream` and read the file footer outside the `FileScanRDD.compute()` thread. Spark's `inputMetrics.bytesRead` is updated from a Hadoop FileSystem thread-local byte counter (`SparkHadoopUtil.getFSBytesReadOnThreadCallback`) that only captures reads on the `compute()` thread — so reads serviced by the pre-opened stream's internal buffer go uncounted.
Comet (via DataFusion's `bytes_scanned`) reports actual file IO, which is the only truthful number to compare against. The existing tight ratio is unrecoverable on Spark 4.1+ for small files without changing what Comet reports — and changing Comet to under-report would be wrong.
What changes are included in this PR?
How are these changes tested?