[SPARK-57710][YARN][TESTS] Reduce YarnClusterSuite flakiness from CI runner contention#56785
[SPARK-57710][YARN][TESTS] Reduce YarnClusterSuite flakiness from CI runner contention#56785HyukjinKwon wants to merge 2 commits into
Conversation
…s from runner contention SPARK-57650 fixed the deterministic ACCEPTED-state hang in BaseYarnClusterSuite (maximum-am-resource-percent). The master Build/Java21 and Build/Java25 `yarn` lanes still go red ~50% of runs: YarnClusterSuite tests intermittently time out (`handle.getState().isFinal() was false`) because the AM/executor containers fail to connect to the driver's RPC server on localhost (Connection refused). The in-JVM mini RM+NM, the driver subprocess and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses the race exits after the default 3 connection retries, and the application can then never finish. Two test-only mitigations in BaseYarnClusterSuite: - Give the mini NodeManager 8GB (and matching max-allocation) so executor allocation is never starved once the ~1.4GB AM is running. - Raise the executor->driver connection retry budget (spark.rpc.io.maxRetries=10, retryWait=2s) so a transient accept stall does not permanently fail the executor. Individual tests can still override. Co-authored-by: Isaac
| // the application unable to finish and the suite times out. Give the executor->driver | ||
| // connection a larger retry budget so a transient stall does not permanently fail the app. | ||
| // These are defaults; individual tests can still override them via extraConf below. | ||
| props.setProperty("spark.rpc.io.maxRetries", "10") |
There was a problem hiding this comment.
Note: the new spark.rpc.io.* defaults are set via setProperty BEFORE the loop that copies spark.* JVM system properties, so a -Dspark.rpc.io.maxRetries flag would silently override them; the comment claims only extraConf can override. Moving the two setProperty calls to just before extraConf.foreach removes the ambiguity.
There was a problem hiding this comment.
Done -- moved both setProperty calls to just after the sys.props copy loop and before extraConf.foreach, so an inherited -Dspark.rpc.io.* flag no longer silently overrides them and extraConf remains the sole override. Updated the comment to say so.
…rops Address review feedback: previously the rpc retry defaults were set before the loop that copies inherited spark.* JVM system properties, so a -Dspark.rpc.io.* flag would silently override them, contradicting the comment that only extraConf overrides. Move the two setProperty calls to just after the JVM-property copy loop and just before extraConf.foreach, so the defaults win over inherited flags while extraConf remains the sole override. Co-authored-by: Isaac
dongjoon-hyun
left a comment
There was a problem hiding this comment.
BTW, @HyukjinKwon . I removed [FOLLOWUP] from the PR title bacause SPARK-57710 is a new JIRA issue.
What changes were proposed in this pull request?
Follow-up to SPARK-57650, which fixed the deterministic "AM stuck in ACCEPTED" hang in
BaseYarnClusterSuite. Two further test-only changes to reduce the remaining flakiness ofYarnClusterSuite:NodeManager8GB (yarn.nodemanager.resource.memory-mb+yarn.scheduler.maximum-allocation-mb) so executor allocation is never starved once the ~1.4GB AM is running.spark.rpc.io.maxRetries=10,spark.rpc.io.retryWait=2s) so a transient RPC-accept stall does not permanently fail an executor. These are defaults that individual tests can still override viaextraConf.Why are the changes needed?
Even after SPARK-57650, the scheduled
Build / Java21andBuild / Java25master lanes fail in theyarnmodule roughly 50% of runs (e.g. fork run28151220075PASS vs28151247521FAIL — same commit, 40s apart). All failures are the same sixYarnClusterSuitetests timing out after 3 minutes (The code passed to eventually never returned normally ... handle.getState().isFinal() was false).From the
yarn-app-log/unit-tests-logartifacts, the AM/driver comes up, but the executor (and sometimes the AM) intermittently fail to connect back to the driver's RPC server onlocalhost(java.io.IOException: Failed to connect to localhost/127.0.0.1:<port>, connection refused). The in-JVM mini RM+NM, the driver subprocess and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses this race exits after the default 3 connection retries, and the application can then never reach a final state.Does this PR introduce any user-facing change?
No. Test-only.
How was this patch tested?
YarnClusterSuitewas previously failing ~50% of the time. With this change theyarnmodule job was run 6 times on the fork; all 6 passed, withYarnClusterSuitereportingtests=30, failures=0, skipped=0(the 6 formerly-failing tests now pass):28148781009(Build / Java21) — 6YarnClusterSuitetimeouts.28162182834,28162247111,28162249819,28175759262,28175762257,28175765871—yarnjob green in all six.Was this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Claude Code.