Skip to content

[SPARK-57650][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from runner contention#56800

Closed
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:ci-fix/agent5afa-pr-yarn-followup
Closed

[SPARK-57650][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from runner contention#56800
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:ci-fix/agent5afa-pr-yarn-followup

Conversation

@HyukjinKwon

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Two test-only mitigations in BaseYarnClusterSuite to reduce YarnClusterSuite flakiness on busy CI runners:

  • Give the mini NodeManager 8GB (yarn.nodemanager.resource.memory-mb) and a matching yarn.scheduler.maximum-allocation-mb, so executor allocation is never starved once the ~1.4GB AM is running.
  • Raise the executor→driver connection retry budget (spark.rpc.io.maxRetries=10, spark.rpc.io.retryWait=2s) so a transient accept stall does not permanently fail the executor. These are defaults; individual tests can still override them.

Why are the changes needed?

This is a follow-up to SPARK-57650, which fixed the deterministic ACCEPTED-state hang (maximum-am-resource-percent). After that fix, the master Build / Maven yarn lanes still go red intermittently: YarnClusterSuite tests time out (handle.getState().isFinal() was false) because the AM/executor containers fail to connect to the driver's RPC server on localhost (Connection refused).

The in-JVM mini RM+NM, the driver subprocess, and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses the race exits after the default 3 connection retries, and the application can then never finish.

This is one of the two remaining unmerged fixes keeping the apache/spark master matrix builds red.

Does this PR introduce any user-facing change?

No. Test-only.

How was this patch tested?

resource-managers/yarn module tests on a fork, repeated multiple times to confirm the flakiness is gone (the formerly-failing YarnClusterSuite tests pass consistently).

Was this patch authored or co-authored using generative AI tooling?

Yes.

This pull request and its description were written by Isaac.

…s from runner contention

SPARK-57650 fixed the deterministic ACCEPTED-state hang in BaseYarnClusterSuite
(maximum-am-resource-percent). The master Build/Java21 and Build/Java25 `yarn`
lanes still go red ~50% of runs: YarnClusterSuite tests intermittently time out
(`handle.getState().isFinal() was false`) because the AM/executor containers fail
to connect to the driver's RPC server on localhost (Connection refused). The
in-JVM mini RM+NM, the driver subprocess and the AM/executor JVMs all contend for
CPU on a single CI runner, so the driver's accept loop occasionally stalls; an
executor that loses the race exits after the default 3 connection retries, and the
application can then never finish.

Two test-only mitigations in BaseYarnClusterSuite:
- Give the mini NodeManager 8GB (and matching max-allocation) so executor
  allocation is never starved once the ~1.4GB AM is running.
- Raise the executor->driver connection retry budget
  (spark.rpc.io.maxRetries=10, retryWait=2s) so a transient accept stall does
  not permanently fail the executor. Individual tests can still override.

Co-authored-by: Isaac
@HyukjinKwon HyukjinKwon changed the title [SPARK-57650][YARN][TESTS][FOLLOWUP] Reduce YarnClusterSuite flakiness from runner contention [SPARK-57650][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from runner contention Jun 26, 2026
@HyukjinKwon

Copy link
Copy Markdown
Member Author

Closing as a duplicate of #56785, which carries the same test-only YarnClusterSuite mitigations as a focused draft and is CI-validated (full yarn-module Maven on JDK17 + JDK21). This branch also picked up unrelated commits. Let's consolidate on #56785.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant