[SPARK-57650][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from runner contention#56800
Closed
HyukjinKwon wants to merge 1 commit into
Closed
Conversation
…s from runner contention SPARK-57650 fixed the deterministic ACCEPTED-state hang in BaseYarnClusterSuite (maximum-am-resource-percent). The master Build/Java21 and Build/Java25 `yarn` lanes still go red ~50% of runs: YarnClusterSuite tests intermittently time out (`handle.getState().isFinal() was false`) because the AM/executor containers fail to connect to the driver's RPC server on localhost (Connection refused). The in-JVM mini RM+NM, the driver subprocess and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses the race exits after the default 3 connection retries, and the application can then never finish. Two test-only mitigations in BaseYarnClusterSuite: - Give the mini NodeManager 8GB (and matching max-allocation) so executor allocation is never starved once the ~1.4GB AM is running. - Raise the executor->driver connection retry budget (spark.rpc.io.maxRetries=10, retryWait=2s) so a transient accept stall does not permanently fail the executor. Individual tests can still override. Co-authored-by: Isaac
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Two test-only mitigations in
BaseYarnClusterSuiteto reduceYarnClusterSuiteflakiness on busy CI runners:NodeManager8GB (yarn.nodemanager.resource.memory-mb) and a matchingyarn.scheduler.maximum-allocation-mb, so executor allocation is never starved once the ~1.4GB AM is running.spark.rpc.io.maxRetries=10,spark.rpc.io.retryWait=2s) so a transient accept stall does not permanently fail the executor. These are defaults; individual tests can still override them.Why are the changes needed?
This is a follow-up to SPARK-57650, which fixed the deterministic ACCEPTED-state hang (
maximum-am-resource-percent). After that fix, the masterBuild / Mavenyarnlanes still go red intermittently:YarnClusterSuitetests time out (handle.getState().isFinal() was false) because the AM/executor containers fail to connect to the driver's RPC server on localhost (Connection refused).The in-JVM mini RM+NM, the driver subprocess, and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses the race exits after the default 3 connection retries, and the application can then never finish.
This is one of the two remaining unmerged fixes keeping the apache/spark master matrix builds red.
Does this PR introduce any user-facing change?
No. Test-only.
How was this patch tested?
resource-managers/yarnmodule tests on a fork, repeated multiple times to confirm the flakiness is gone (the formerly-failingYarnClusterSuitetests pass consistently).Build / Maven (Scala 2.13, JDK 17)—resource-managers#yarn,YarnClusterSuitetimes out (handle.getState().isFinal() was false,BaseYarnClusterSuite.scala:220): https://github.com/apache/spark/actions/runs/28174751831resource-managers/yarnmodule green on the fork — https://github.com/HyukjinKwon/spark/actions/runs/28204995015/job/83556094951Was this patch authored or co-authored using generative AI tooling?
Yes.
This pull request and its description were written by Isaac.