Skip to content

fix(client): send migrating_from after repeated rejoin failures#2287

Open
jdimovska wants to merge 4 commits into
mainfrom
migrate-during-reconnect
Open

fix(client): send migrating_from after repeated rejoin failures#2287
jdimovska wants to merge 4 commits into
mainfrom
migrate-during-reconnect

Conversation

@jdimovska

@jdimovska jdimovska commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

💡 Overview

This PR fixes an issue where a reconnecting client could get stuck rejoining the same unhealthy SFU, because the coordinator's /join can hand back the same SFU that failed. Failed rejoins are counted per SFU during a reconnection, and once an SFU fails enough times we set migrating_from / migrating_from_list so the coordinator gives us a different SFU (the hint applies only to that one request, so a following FAST reconnect isn't affected).

📝 Implementation notes

🎫 Ticket: https://linear.app/stream/issue/REACT-1013/enhance-reconnect-flow-with-migrating-from-option

📑 Docs: https://github.com/GetStream/docs-content/pull/

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved call reconnection reliability by refining retry behavior and tracking repeated SFU join failures to make smarter rejoin decisions.
    • Enhanced reconnection handling to reliably include migration hints during rejoin retries, improving recovery after SFU capacity issues while keeping call join parameters unchanged.
    • Strengthened reconnection test coverage to validate the updated rejoin and SFU exclusion behavior across multiple retries.

@changeset-bot

changeset-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 730e758

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Call.reconnect() gains a per-invocation sfuRejoinFailures map that tracks REJOIN join failures by SFU edge name. On each REJOIN attempt, confirmed-bad SFUs (≥ 2 failures) are temporarily injected as migrating_from/migrating_from_list into joinCallData, then always removed in a finally block. Errors increment the per-SFU count, with SfuJoinError codes that should trigger switching forcing the count to at least 2. The source of migratingFromSfuId in doJoin is switched from data.migrating_from to previousSfuClient.edgeName. Tests cover repeated rejoin failures, SFU_FULL errors, and the two-tries-before-exclusion sequencing.

Changes

SFU Migration Hints on REJOIN Retry

Layer / File(s) Summary
Per-SFU failure tracking and migration hint injection
packages/client/src/Call.ts
Initializes sfuRejoinFailures map per reconnect() call; snapshots reconnectStrategy into attemptedStrategy/currentStrategy each loop iteration; temporarily sets migrating_from/migrating_from_list on joinCallData in the REJOIN branch for SFUs with ≥ 2 failures, clearing them in a finally block; increments the per-SFU failure count on error with SfuJoinError-aware threshold forcing; sources migratingFromSfuId from previousSfuClient.edgeName instead of data.migrating_from during doJoin reconnectDetails preparation.
Reconnect test helper update and migration hint tests
packages/client/src/rtc/__tests__/Call.reconnect.test.ts
Adds reportingEnabled option to makeCall and imports SfuJoinError; introduces three new tests asserting migrating_from/migrating_from_list propagation to doJoinRequest after repeated REJOIN failures, after SfuJoinError SFU_FULL, and across the two-tries-before-exclusion SFU sequencing, each confirming joinCallData stays { create: true }.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • oliverlaz

🐇 Hop hop, the SFU's full today,
Two strikes and you're out, bunny says nay!
migrating_from flies through the air,
finally cleans up without a care.
The tests all pass — a joyful display! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: sending migrating_from hints after repeated rejoin failures to avoid getting stuck on unhealthy SFUs.
Description check ✅ Passed The description includes all required template sections with clear explanation of the problem, solution, and ticket reference.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch migrate-during-reconnect

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jdimovska jdimovska marked this pull request as ready for review June 17, 2026 09:20
@jdimovska jdimovska requested a review from oliverlaz June 17, 2026 10:03

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/client/src/Call.ts (2)

1870-1891: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the captured strategy when deciding whether a failed migration must rejoin.

doJoin() resets this.reconnectStrategy after a successful SFU join. If reconnectMigrate() fails after that point, wasMigrating can read UNSPECIFIED and skip the intended REJOIN fallback.

Proposed fix
-          const wasMigrating =
-            this.reconnectStrategy === WebsocketReconnectStrategy.MIGRATE;
+          const wasMigrating =
+            attemptedStrategy === WebsocketReconnectStrategy.MIGRATE;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/client/src/Call.ts` around lines 1870 - 1891, The wasMigrating check
in the reconnect logic reads the current this.reconnectStrategy, but doJoin()
can reset this value after a successful SFU join. If reconnectMigrate() fails
after doJoin() has already executed, wasMigrating will read the new strategy
instead of the original MIGRATE state, causing the intended REJOIN fallback to
be skipped. Capture the this.reconnectStrategy value at the start of the
reconnection attempt before doJoin() can modify it, and then use this captured
strategy value in the wasMigrating condition instead of reading
this.reconnectStrategy directly.

1826-1836: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat unrecoverable SfuJoinErrors as terminal during reconnect.

join() already aborts on SfuJoinError.unrecoverable, but this reconnect catch only stops on unrecoverable coordinator errors. An unrecoverable SFU join failure can now be counted as retryable and keep cycling through migration hints until retry budgets expire.

Proposed fix
-          if (error instanceof ErrorFromResponse && error.unrecoverable) {
+          if (
+            (error instanceof ErrorFromResponse && error.unrecoverable) ||
+            (error instanceof SfuJoinError && error.unrecoverable)
+          ) {
             this.logger.warn(
-              `[Reconnect] Can't reconnect due to coordinator unrecoverable error`,
+              `[Reconnect] Can't reconnect due to unrecoverable join error`,
               error,
             );
             await markAsReconnectingFailed();
             return;

Also applies to: 1848-1855

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/client/src/Call.ts` around lines 1826 - 1836, The reconnect logic is
not treating unrecoverable SfuJoinError exceptions as terminal failures during
REJOIN attempts. Modify the switchSfu condition assignment to additionally check
that the SfuJoinError is not unrecoverable using SfuJoinError.unrecoverable.
Ensure that unrecoverable SFU join failures are prevented from being counted as
retryable by adding a negation check alongside the existing
SfuJoinError.isJoinErrorCode condition, so these failures stop retry cycling
immediately rather than continuing to attempt SFU migration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@packages/client/src/Call.ts`:
- Around line 1870-1891: The wasMigrating check in the reconnect logic reads the
current this.reconnectStrategy, but doJoin() can reset this value after a
successful SFU join. If reconnectMigrate() fails after doJoin() has already
executed, wasMigrating will read the new strategy instead of the original
MIGRATE state, causing the intended REJOIN fallback to be skipped. Capture the
this.reconnectStrategy value at the start of the reconnection attempt before
doJoin() can modify it, and then use this captured strategy value in the
wasMigrating condition instead of reading this.reconnectStrategy directly.
- Around line 1826-1836: The reconnect logic is not treating unrecoverable
SfuJoinError exceptions as terminal failures during REJOIN attempts. Modify the
switchSfu condition assignment to additionally check that the SfuJoinError is
not unrecoverable using SfuJoinError.unrecoverable. Ensure that unrecoverable
SFU join failures are prevented from being counted as retryable by adding a
negation check alongside the existing SfuJoinError.isJoinErrorCode condition, so
these failures stop retry cycling immediately rather than continuing to attempt
SFU migration.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aaf2a576-6ea5-4126-ab11-bb35191bdaad

📥 Commits

Reviewing files that changed from the base of the PR and between 67c49bb and 730e758.

📒 Files selected for processing (1)
  • packages/client/src/Call.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant