[SPARK-57711][SQL][TEST] Deflake SQLAppStatusListenerMemoryLeakSuite "no memory leak"#56790
[SPARK-57711][SQL][TEST] Deflake SQLAppStatusListenerMemoryLeakSuite "no memory leak"#56790HyukjinKwon wants to merge 2 commits into
Conversation
… "no memory leak" ### What changes were proposed in this pull request? Wrap the final `noLiveData()` assertion in `SQLAppStatusListenerMemoryLeakSuite` in an `eventually(...)` block. ### Why are the changes needed? The test fails intermittently on the SBT `sql - other tests` job (e.g. master runs 28000645341, 28004073554) with `noLiveData() was false`. The cleanup that removes entries from `liveExecutions`/`stageMetrics` is finalized by the metrics aggregation triggered on `SparkListenerSQLExecutionEnd`; asserting immediately after `waitUntilEmpty()` is racy. Polling with `eventually` removes the race without weakening the assertion. ### Does this PR introduce any user-facing change? No, test only. ### How was this patch tested? Existing test, repeated. Co-authored-by: Isaac
| assert(statusStore.planGraphCount() <= 50) | ||
| // No live data should be left behind after all executions end. | ||
| assert(statusStore.listener.get.noLiveData()) | ||
| // No live data should be left behind after all executions end. The cleanup of live |
There was a problem hiding this comment.
The added comment / PR body pin the flake on the kvstore.doAsync cleanup completing after the bus drains, but the test sets ASYNC_TRACKING_ENABLED=false, which makes doAsync run synchronously on the listener thread (sameThreadExecutor); so waitUntilEmpty() should already cover that path. The stated root cause likely doesn't hold under this config; the real residual (e.g. stageMetrics cleanup) isn't identified. The fix is still safe, but please clarify the actual race (or confirm the CI env doesn't apply the config) so this isn't masking an undiagnosed condition. Tighten the inline comment accordingly.
There was a problem hiding this comment.
Good catch, you're right. With ASYNC_TRACKING_ENABLED=false the kvstore.doAsync block in onExecutionEnd runs inline (ElementTrackingStore uses sameThreadExecutorService), so it's already covered by waitUntilEmpty() and isn't the cause.
The actual race is in end-event delivery. The test runs failing jobs (df.foreach { throw ... }), and for a failed job DAGScheduler.failJobAndIndependentStages notifies the job waiter -- unblocking the failing action on the test thread -- before it posts SparkListenerJobEnd:
cleanupStateForJobAndIndependentStages(job)
job.listener.jobFailed(error) // unblocks the action
listenerBus.post(SparkListenerJobEnd(job.jobId, ..., JobFailed(error))) // posted afterwardsSo the test thread can race ahead, exit the loop, and call waitUntilEmpty() before the trailing JobEnd for the last failed execution is enqueued. If it lands just after the bus drains, that execution never reaches the jobs.size + 1 cleanup threshold and lingers in liveExecutions, so noLiveData() is intermittently false. I've rewritten the inline comment and PR description accordingly.
| // executions/stage metrics is finalized when the metrics aggregation triggered by the | ||
| // SQLExecutionEnd event completes, so wait for the listener to drain rather than asserting | ||
| // immediately to avoid a timing race. | ||
| eventually(timeout(10.seconds), interval(10.milliseconds)) { |
There was a problem hiding this comment.
Note: stacking a 10s eventually on top of the existing 10s waitUntilEmpty() doubles worst-case hang to ~20s on a genuinely broken system.
There was a problem hiding this comment.
Reduced the eventually timeout to 5.seconds. The trailing end event lands within milliseconds in practice, so the worst-case additive hang is now ~15s rather than ~20s.
…meout Address review feedback: the doAsync metrics-aggregation explanation was wrong under ASYNC_TRACKING_ENABLED=false (it runs inline). The real race is a trailing SparkListenerJobEnd for a failed job posted after the job waiter is notified, so it can arrive after waitUntilEmpty() drains the bus. Reduce the eventually timeout from 10s to 5s to avoid doubling the worst-case hang. Co-authored-by: Isaac
What changes were proposed in this pull request?
Wrap the final
noLiveData()assertion inSQLAppStatusListenerMemoryLeakSuite("no memory leak") in an
eventually(...)block so it polls for the trailingcleanup instead of asserting once immediately.
Why are the changes needed?
The test is flaky on the SBT
sql - other testsjob, failing withnoLiveData() was false:SQLAppStatusListenerremoves an execution's entries fromliveExecutions/stageMetricsonly once its end-event count reachesjobs.size + 1, i.e. onceall of its
SparkListenerJobEndevents and itsSparkListenerSQLExecutionEndhave been processed.
The test runs 100 failing jobs (
df.foreach { throw ... }). For a failed job,DAGScheduler.failJobAndIndependentStagesnotifies the job waiter -- whichunblocks the failing action on the test thread -- before it posts
SparkListenerJobEndto the listener bus:So the trailing
JobEndfor the last failed execution can still be in flight onthe DAGScheduler event-loop thread when the test thread races ahead, exits the
loop, and calls
sc.listenerBus.waitUntilEmpty(). If thatJobEndis enqueuedjust after the bus is drained, the failed execution never reaches the cleanup
threshold and lingers in
liveExecutions, so assertingnoLiveData()immediately intermittently observes leftover live data.
Note this is unrelated to the
kvstore.doAsyncmetrics aggregation inonExecutionEnd: the test setsASYNC_TRACKING_ENABLED=false, so that callbackruns inline on the listener thread (
sameThreadExecutorService) and is alreadycovered by
waitUntilEmpty(). The residual race is purely in the delivery ofthe end events.
Polling with
eventuallylets the trailing event get delivered and the liveentries drain without weakening the assertion. The timeout is kept modest
(
5.seconds) since the trailing event lands within milliseconds in practice.Does this PR introduce any user-facing change?
No, test only.
How was this patch tested?
Existing test, run repeatedly. This reproduces under SBT, so the standard fork
"Build" (SBT) exercises it.
CI evidence
sql - other tests--noLiveData() was false)Was this patch authored or co-authored using generative AI tooling?
Yes, drafted with assistance from Isaac.