[SPARK-57650][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from runner contention by HyukjinKwon · Pull Request #56800 · apache/spark

HyukjinKwon · 2026-06-26T02:53:48Z

What changes were proposed in this pull request?

Two test-only mitigations in BaseYarnClusterSuite to reduce YarnClusterSuite flakiness on busy CI runners:

Give the mini NodeManager 8GB (yarn.nodemanager.resource.memory-mb) and a matching yarn.scheduler.maximum-allocation-mb, so executor allocation is never starved once the ~1.4GB AM is running.
Raise the executor→driver connection retry budget (spark.rpc.io.maxRetries=10, spark.rpc.io.retryWait=2s) so a transient accept stall does not permanently fail the executor. These are defaults; individual tests can still override them.

Why are the changes needed?

This is a follow-up to SPARK-57650, which fixed the deterministic ACCEPTED-state hang (maximum-am-resource-percent). After that fix, the master Build / Maven yarn lanes still go red intermittently: YarnClusterSuite tests time out (handle.getState().isFinal() was false) because the AM/executor containers fail to connect to the driver's RPC server on localhost (Connection refused).

The in-JVM mini RM+NM, the driver subprocess, and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses the race exits after the default 3 connection retries, and the application can then never finish.

This is one of the two remaining unmerged fixes keeping the apache/spark master matrix builds red.

Does this PR introduce any user-facing change?

No. Test-only.

How was this patch tested?

resource-managers/yarn module tests on a fork, repeated multiple times to confirm the flakiness is gone (the formerly-failing YarnClusterSuite tests pass consistently).

Before (still flaky on apache/spark master, after the SPARK-57650 base fix merged): Build / Maven (Scala 2.13, JDK 17) — resource-managers#yarn, YarnClusterSuite times out (handle.getState().isFinal() was false, BaseYarnClusterSuite.scala:220): https://github.com/apache/spark/actions/runs/28174751831
After (passing on this branch): resource-managers/yarn module green on the fork — https://github.com/HyukjinKwon/spark/actions/runs/28204995015/job/83556094951

Was this patch authored or co-authored using generative AI tooling?

Yes.

This pull request and its description were written by Isaac.

…s from runner contention SPARK-57650 fixed the deterministic ACCEPTED-state hang in BaseYarnClusterSuite (maximum-am-resource-percent). The master Build/Java21 and Build/Java25 `yarn` lanes still go red ~50% of runs: YarnClusterSuite tests intermittently time out (`handle.getState().isFinal() was false`) because the AM/executor containers fail to connect to the driver's RPC server on localhost (Connection refused). The in-JVM mini RM+NM, the driver subprocess and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses the race exits after the default 3 connection retries, and the application can then never finish. Two test-only mitigations in BaseYarnClusterSuite: - Give the mini NodeManager 8GB (and matching max-allocation) so executor allocation is never starved once the ~1.4GB AM is running. - Raise the executor->driver connection retry budget (spark.rpc.io.maxRetries=10, retryWait=2s) so a transient accept stall does not permanently fail the executor. Individual tests can still override. Co-authored-by: Isaac

HyukjinKwon · 2026-06-26T05:46:44Z

Closing as a duplicate of #56785, which carries the same test-only YarnClusterSuite mitigations as a focused draft and is CI-validated (full yarn-module Maven on JDK17 + JDK21). This branch also picked up unrelated commits. Let's consolidate on #56785.

HyukjinKwon changed the title ~~[SPARK-57650][YARN][TESTS][FOLLOWUP] Reduce YarnClusterSuite flakiness from runner contention~~ [SPARK-57650][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from runner contention Jun 26, 2026

HyukjinKwon closed this Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57650][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from runner contention#56800

[SPARK-57650][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from runner contention#56800
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:ci-fix/agent5afa-pr-yarn-followup

HyukjinKwon commented Jun 26, 2026

Uh oh!

HyukjinKwon commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

HyukjinKwon commented Jun 26, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant