Skip to content

[SPARK-57710][YARN][TESTS] Reduce YarnClusterSuite flakiness from CI runner contention#56785

Open
HyukjinKwon wants to merge 2 commits into
apache:masterfrom
HyukjinKwon:ci-fix/agent8-yarn-cluster-flaky
Open

[SPARK-57710][YARN][TESTS] Reduce YarnClusterSuite flakiness from CI runner contention#56785
HyukjinKwon wants to merge 2 commits into
apache:masterfrom
HyukjinKwon:ci-fix/agent8-yarn-cluster-flaky

Conversation

@HyukjinKwon

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Follow-up to SPARK-57650, which fixed the deterministic "AM stuck in ACCEPTED" hang in BaseYarnClusterSuite. Two further test-only changes to reduce the remaining flakiness of YarnClusterSuite:

  1. Give the single mini NodeManager 8GB (yarn.nodemanager.resource.memory-mb + yarn.scheduler.maximum-allocation-mb) so executor allocation is never starved once the ~1.4GB AM is running.
  2. Raise the executor→driver connection retry budget for the launched apps (spark.rpc.io.maxRetries=10, spark.rpc.io.retryWait=2s) so a transient RPC-accept stall does not permanently fail an executor. These are defaults that individual tests can still override via extraConf.

Why are the changes needed?

Even after SPARK-57650, the scheduled Build / Java21 and Build / Java25 master lanes fail in the yarn module roughly 50% of runs (e.g. fork run 28151220075 PASS vs 28151247521 FAIL — same commit, 40s apart). All failures are the same six YarnClusterSuite tests timing out after 3 minutes (The code passed to eventually never returned normally ... handle.getState().isFinal() was false).

From the yarn-app-log / unit-tests-log artifacts, the AM/driver comes up, but the executor (and sometimes the AM) intermittently fail to connect back to the driver's RPC server on localhost (java.io.IOException: Failed to connect to localhost/127.0.0.1:<port>, connection refused). The in-JVM mini RM+NM, the driver subprocess and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses this race exits after the default 3 connection retries, and the application can then never reach a final state.

Does this PR introduce any user-facing change?

No. Test-only.

How was this patch tested?

YarnClusterSuite was previously failing ~50% of the time. With this change the yarn module job was run 6 times on the fork; all 6 passed, with YarnClusterSuite reporting tests=30, failures=0, skipped=0 (the 6 formerly-failing tests now pass):

  • Before (master, failing): apache/spark run 28148781009 (Build / Java21) — 6 YarnClusterSuite timeouts.
  • After (this branch): HyukjinKwon/spark runs 28162182834, 28162247111, 28162249819, 28175759262, 28175762257, 28175765871yarn job green in all six.

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude Code.

…s from runner contention

SPARK-57650 fixed the deterministic ACCEPTED-state hang in BaseYarnClusterSuite
(maximum-am-resource-percent). The master Build/Java21 and Build/Java25 `yarn`
lanes still go red ~50% of runs: YarnClusterSuite tests intermittently time out
(`handle.getState().isFinal() was false`) because the AM/executor containers fail
to connect to the driver's RPC server on localhost (Connection refused). The
in-JVM mini RM+NM, the driver subprocess and the AM/executor JVMs all contend for
CPU on a single CI runner, so the driver's accept loop occasionally stalls; an
executor that loses the race exits after the default 3 connection retries, and the
application can then never finish.

Two test-only mitigations in BaseYarnClusterSuite:
- Give the mini NodeManager 8GB (and matching max-allocation) so executor
  allocation is never starved once the ~1.4GB AM is running.
- Raise the executor->driver connection retry budget
  (spark.rpc.io.maxRetries=10, retryWait=2s) so a transient accept stall does
  not permanently fail the executor. Individual tests can still override.

Co-authored-by: Isaac
@HyukjinKwon HyukjinKwon changed the title [SPARK-57650][YARN][TESTS][FOLLOWUP] Reduce YarnClusterSuite flakiness from CI runner contention [DO-NOT-MERGE][YARN][TESTS][FOLLOWUP] Reduce YarnClusterSuite flakiness from CI runner contention Jun 25, 2026
@HyukjinKwon HyukjinKwon marked this pull request as ready for review June 26, 2026 05:52
@HyukjinKwon HyukjinKwon changed the title [DO-NOT-MERGE][YARN][TESTS][FOLLOWUP] Reduce YarnClusterSuite flakiness from CI runner contention [DO-NOT-MERGE][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from CI runner contention Jun 26, 2026
@HyukjinKwon HyukjinKwon marked this pull request as draft June 26, 2026 05:53
@HyukjinKwon HyukjinKwon changed the title [DO-NOT-MERGE][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from CI runner contention [SPARK-57710][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from CI runner contention Jun 26, 2026
@HyukjinKwon HyukjinKwon marked this pull request as ready for review June 26, 2026 05:57
// the application unable to finish and the suite times out. Give the executor->driver
// connection a larger retry budget so a transient stall does not permanently fail the app.
// These are defaults; individual tests can still override them via extraConf below.
props.setProperty("spark.rpc.io.maxRetries", "10")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the new spark.rpc.io.* defaults are set via setProperty BEFORE the loop that copies spark.* JVM system properties, so a -Dspark.rpc.io.maxRetries flag would silently override them; the comment claims only extraConf can override. Moving the two setProperty calls to just before extraConf.foreach removes the ambiguity.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done -- moved both setProperty calls to just after the sys.props copy loop and before extraConf.foreach, so an inherited -Dspark.rpc.io.* flag no longer silently overrides them and extraConf remains the sole override. Updated the comment to say so.

…rops

Address review feedback: previously the rpc retry defaults were set before the
loop that copies inherited spark.* JVM system properties, so a -Dspark.rpc.io.*
flag would silently override them, contradicting the comment that only extraConf
overrides. Move the two setProperty calls to just after the JVM-property copy
loop and just before extraConf.foreach, so the defaults win over inherited flags
while extraConf remains the sole override.

Co-authored-by: Isaac
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-57710][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from CI runner contention [SPARK-57710][YARN][TESTS] Reduce YarnClusterSuite flakiness from CI runner contention Jun 26, 2026

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, @HyukjinKwon . I removed [FOLLOWUP] from the PR title bacause SPARK-57710 is a new JIRA issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants