Skip to content

fix(client): harden SFU signal-close revival and subscriber ICE-restart recovery#2285

Open
oliverlaz wants to merge 6 commits into
mainfrom
feat/ice-candidate-pair-monitor
Open

fix(client): harden SFU signal-close revival and subscriber ICE-restart recovery#2285
oliverlaz wants to merge 6 commits into
mainfrom
feat/ice-candidate-pair-monitor

Conversation

@oliverlaz

@oliverlaz oliverlaz commented Jun 15, 2026

Copy link
Copy Markdown
Member

💡 Overview

Two reconnection-robustness fixes for the core client.

Note: the organic ICE path-migration detection this branch originally carried has been reverted; Publisher and BasePeerConnection are back to main.

Signal-close revival hardening (StreamSfuClient, Call). A wedged signal WebSocket can be detected as dead by two independent sources (the health watchdog and the transport close event), which on a stuck socket can arrive seconds apart. Revival is now driven by a one-shot latch so it fires at most once per client, triggers immediately on an unhealthy close instead of waiting for the (possibly long-delayed) close event, and ignores closes from a superseded client so a stale socket cannot drive a reconnect after a new client has swapped in.

Subscriber ICE-restart recovery (Subscriber). Subscriber.negotiate reset isIceRestarting as the last statement of its happy path, so a negotiation failure left the flag stuck at true and disabled all future automatic ICE-restart / reconnect handling on the subscriber. Negotiation is now wrapped in try/catch/finally so the flag always clears; on failure the remote description is rolled back when the peer connection is still awaiting an answer (have-remote-offer), and the failure is traced.

📝 Implementation notes

  • StreamSfuClient: handleWebSocketClose becomes notifySignalClose, guarded by a signalClosed one-shot latch; close() notifies on a non-clean close without waiting for the transport event.
  • Call.handleSfuSignalClose ignores closes from any client that is no longer the active sfuClient.
  • Subscriber.negotiate wrapped in try/catch/finally; rollback on have-remote-offer; a subscriber.negotiationFailed trace added.
  • Tests added/updated: StreamSfuClient revival, Call superseded-client guard, and Subscriber success / failure / rollback branches.

🎫 Ticket: https://linear.app/stream/issue/XYZ-123

📑 Docs: https://github.com/GetStream/docs-content/pull/

Add CandidatePairMonitor, which watches the publisher's BUNDLE-shared
ICE transport for selected-candidate-pair changes (e.g. WiFi to LTE)
and triggers an ICE restart so the SFU learns the new transport path.

A settle window, opened on startup and on every restartIce(), absorbs
the restart-induced candidate-pair churn so reconnect-driven restarts
(Call.restoreICE) do not loop back into another restart. Covered by
unit tests for the monitor and Publisher-level reconnect-interplay tests.
@changeset-bot

changeset-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: aed0593

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@oliverlaz, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 12 minutes and 7 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a12e6298-334e-4dd8-8d37-3567800b3032

📥 Commits

Reviewing files that changed from the base of the PR and between 1d435f1 and aed0593.

📒 Files selected for processing (5)
  • packages/client/src/rtc/Publisher.ts
  • packages/client/src/rtc/__tests__/Publisher.test.ts
  • packages/client/src/rtc/__tests__/mocks/webrtc.mocks.ts
  • packages/client/src/rtc/helpers/CandidatePairMonitor.ts
  • packages/client/src/rtc/helpers/__tests__/CandidatePairMonitor.test.ts
📝 Walkthrough

Walkthrough

Three independent improvements to reconnect resilience: (1) BasePeerConnection gains protected onIceDisconnected/shouldKeepScheduledIceRestart hooks; Publisher overrides them to snapshot the ICE candidate pair on disconnect and trigger a restart only when the transport path changed. (2) Subscriber.negotiate now wraps negotiation in try/catch/finally to perform SDP rollback on mid-offer failure and always clear the ICE restarting flag. (3) StreamSfuClient adds a one-shot signalClosed latch via notifySignalClose; Call.handleSfuSignalClose gains a guard that ignores events from superseded client instances.

Changes

ICE Candidate-Pair Path Migration Detection

Layer / File(s) Summary
BasePeerConnection protected ICE hooks
packages/client/src/rtc/BasePeerConnection.ts
Adds protected onIceDisconnected() and shouldKeepScheduledIceRestart() hooks. Wires the disconnect call into the disconnected state handler and updates connected/completed handling to clear the restart timeout and either call tryRestartIce() or log cancellation based on the hook's return value.
Publisher candidate-pair snapshot and migration check
packages/client/src/rtc/Publisher.ts
Adds pairBeforeDisconnect field. Overrides onIceDisconnected to capture the selected candidate pair from RTCIceTransport.getSelectedCandidatePair(). Overrides shouldKeepScheduledIceRestart to return true only when local or remote candidates differ from the snapshot, indicating a real path migration.
Publisher migration tests and ICE transport mock
packages/client/src/rtc/__tests__/mocks/webrtc.mocks.ts, packages/client/src/rtc/__tests__/Publisher.test.ts
Adds makeIceTransportMock with getSelectedCandidatePair and a __setSelectedCandidatePair test helper wired into RTCRtpTransceiverMock. Adds five tests covering: same-pair cancel, path-migration restart, disconnected-past-timer restart, no-API-surface cancel, and no-restart-loop after migration settling.

Subscriber Negotiation Resilience and Error Recovery

Layer / File(s) Summary
Subscriber negotiation error handling and rollback
packages/client/src/rtc/Subscriber.ts
Refactors negotiate to wrap negotiation in try/catch/finally. On error, performs SDP rollback via setRemoteDescription({ type: 'rollback' }) if signaling state is have-remote-offer. Always clears this.isIceRestarting = false in finally block. Emits tracer event subscriber.negotiationFailed with error message on failure.
Subscriber negotiation test suite
packages/client/src/rtc/__tests__/Subscriber.test.ts
Verifies isIceRestarting is cleared on both success and failure, that rollback occurs when negotiation fails during an applied-offer state, and that rollback is skipped when the offer was never applied. Sets up default sfuClient.sendAnswer mock.

SFU Signal-Close Deduplication and Superseded-Client Guard

Layer / File(s) Summary
StreamSfuClient one-shot notifySignalClose latch
packages/client/src/StreamSfuClient.ts
Adds a private signalClosed boolean latch and notifySignalClose(reason) helper that fires onSignalClose at most once, clears timers, and routes both WebSocket close and close() through it.
Call superseded-client guard and reconnect refactor
packages/client/src/Call.ts
Adds an early return in handleSfuSignalClose ignoring events from non-active client instances. Refactors reconnect's skip-states check to use a captured local callingState variable.
Signal-close and superseded-client tests
packages/client/src/__tests__/StreamSfuClient.test.ts, packages/client/src/rtc/__tests__/Call.reconnect.test.ts
Adds emit helper to CapturingWebSocket, wires onSignalClose into buildSfuClient, and adds three signal-close deduplication tests. Adds a Call test verifying superseded-client close events are ignored while the active client's event triggers reconnect.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • GetStream/stream-video-js#2257: Both PRs modify Call.ts reconnect entry-guard logic and update Call.reconnect.test.ts; that PR adds bail/deduping in reconnect while this PR adds a superseded-client guard in handleSfuSignalClose.
  • GetStream/stream-video-js#2261: Both PRs modify packages/client/src/rtc/BasePeerConnection.ts's ICE/peer-connection state-change handling including onIceConnectionStateChange to change what runs during those transitions.

Suggested reviewers

  • jdimovska

Poem

🐇 Hop, hop — the path has changed, a new trail found today,
No looping restarts, no confusion along the way.
The stale client's close? Ignored with flair!
One latch, one signal — the rabbit's aware.
ICE migration caught, the network path kept true,
This bunny knows exactly which route to pursue! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The PR title accurately summarizes the two main changes: SFU signal-close revival hardening and subscriber ICE-restart recovery, matching the core improvements described in the changeset.
Description check ✅ Passed The PR description comprehensively covers both main objectives with implementation notes and test coverage details, following the required template structure with Overview, Implementation notes, and ticket/docs references.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/ice-candidate-pair-monitor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…ignal-close revival

Replace the standalone CandidatePairMonitor with a reconnect-time check in
BasePeerConnection/Publisher: snapshot the selected candidate pair when ICE
drops to `disconnected`, and on recovery keep the scheduled ICE restart only
if the path actually migrated (e.g. Wi-Fi to LTE). A same-path recovery
cancels the restart, avoiding restart loops.

Also harden signal-close revival in StreamSfuClient and Call:
- add a one-shot latch so onSignalClose fires at most once per client, even
  when both the health watchdog and a delayed WebSocket close detect death
- drive revival immediately on an unhealthy close instead of waiting for the
  transport close event, which a wedged socket may delay for seconds
- ignore signal closes from a superseded client so a stale socket's late
  close cannot drive a reconnect after a new client has swapped in
@oliverlaz oliverlaz changed the title feat(client): detect organic ICE path migrations and restart ICE feat(client): restart ICE on organic path migration and harden signal-close revival Jun 16, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/client/src/rtc/BasePeerConnection.ts (1)

477-482: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear stale timeout state after the callback runs to prevent duplicate restarts.

At Line 477, the timeout handle is never reset when the callback fires. Later, the recovery path at Line 499 can still treat that stale handle as “scheduled” and call shouldKeepScheduledIceRestart()/tryRestartIce() again after a restart already ran.

Suggested fix
         clearTimeout(this.iceRestartTimeout);
         this.iceRestartTimeout = undefined;
         this.iceRestartTimeout = setTimeout(() => {
+          this.iceRestartTimeout = undefined;
           const currentState = this.pc.iceConnectionState;
           if (currentState === 'disconnected' || currentState === 'failed') {
             this.tryRestartIce();
           }
         }, this.iceRestartDelay);

Also applies to: 499-505

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/client/src/rtc/BasePeerConnection.ts` around lines 477 - 482, The
iceRestartTimeout handle is assigned at line 477 but never cleared after the
timeout callback executes, causing the stale handle to appear "scheduled" when
checked later at lines 499-505 in the recovery path. This can trigger duplicate
ICE restart attempts. Fix this by resetting this.iceRestartTimeout to null after
the callback logic completes inside the setTimeout body, ensuring that
subsequent calls to shouldKeepScheduledIceRestart() or tryRestartIce() recognize
that no timeout is currently active. This same cleanup should be applied
wherever iceRestartTimeout callbacks are used to prevent similar stale-state
issues.
🧹 Nitpick comments (1)
packages/client/src/rtc/__tests__/Publisher.test.ts (1)

699-742: ⚡ Quick win

Add one regression test for “timer already fired, then connected” behavior.

The new suite is strong, but it doesn’t cover the case where the scheduled restart has already fired and ICE later transitions to connected/completed. Adding that case would prevent duplicate-restart regressions in this reconnect path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/client/src/rtc/__tests__/Publisher.test.ts` around lines 699 - 742,
Add a new test case after the existing three tests in the Publisher.test.ts
suite that verifies the "timer already fired, then connected" behavior. The test
should use vi.useFakeTimers(), call publishVideo() to capture the ICE transport,
set up a spy on the restartIce method, transition ICE from connected to
disconnected (which schedules a restart), advance timers to allow the scheduled
restart to execute, and then transition ICE back to connected or completed.
Assert that the restartIce method was called exactly once (from the scheduled
timer) and not again when ICE transitions to connected, ensuring no
duplicate-restart regression occurs in the reconnect path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@packages/client/src/rtc/BasePeerConnection.ts`:
- Around line 477-482: The iceRestartTimeout handle is assigned at line 477 but
never cleared after the timeout callback executes, causing the stale handle to
appear "scheduled" when checked later at lines 499-505 in the recovery path.
This can trigger duplicate ICE restart attempts. Fix this by resetting
this.iceRestartTimeout to null after the callback logic completes inside the
setTimeout body, ensuring that subsequent calls to
shouldKeepScheduledIceRestart() or tryRestartIce() recognize that no timeout is
currently active. This same cleanup should be applied wherever iceRestartTimeout
callbacks are used to prevent similar stale-state issues.

---

Nitpick comments:
In `@packages/client/src/rtc/__tests__/Publisher.test.ts`:
- Around line 699-742: Add a new test case after the existing three tests in the
Publisher.test.ts suite that verifies the "timer already fired, then connected"
behavior. The test should use vi.useFakeTimers(), call publishVideo() to capture
the ICE transport, set up a spy on the restartIce method, transition ICE from
connected to disconnected (which schedules a restart), advance timers to allow
the scheduled restart to execute, and then transition ICE back to connected or
completed. Assert that the restartIce method was called exactly once (from the
scheduled timer) and not again when ICE transitions to connected, ensuring no
duplicate-restart regression occurs in the reconnect path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 91e76865-c64d-4dbb-88e3-72d1cb60aa47

📥 Commits

Reviewing files that changed from the base of the PR and between db5718e and fba9660.

📒 Files selected for processing (8)
  • packages/client/src/Call.ts
  • packages/client/src/StreamSfuClient.ts
  • packages/client/src/__tests__/StreamSfuClient.test.ts
  • packages/client/src/rtc/BasePeerConnection.ts
  • packages/client/src/rtc/Publisher.ts
  • packages/client/src/rtc/__tests__/Call.reconnect.test.ts
  • packages/client/src/rtc/__tests__/Publisher.test.ts
  • packages/client/src/rtc/__tests__/mocks/webrtc.mocks.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/client/src/rtc/tests/mocks/webrtc.mocks.ts

Wrap Subscriber.negotiate in try/catch/finally so isIceRestarting is
always cleared, even when negotiation throws. Previously the reset was
the last statement of the happy path and was skipped on failure,
leaving the flag stuck at true and disabling all future automatic ICE
restart/reconnect handling on the subscriber.

On failure, roll back the remote description when the peer connection
is still awaiting an answer (have-remote-offer), and trace the
negotiation failure. Add tests covering the success, failure, and
rollback branches.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/client/src/rtc/Subscriber.ts (1)

204-211: 💤 Low value

Rollback failure could mask the original negotiation error.

If setRemoteDescription({ type: 'rollback' }) throws, the original err is never rethrown and its context is lost. Consider wrapping the rollback in a nested try/catch to preserve the original error for debugging:

Proposed fix to preserve original error
     } catch (err) {
       if (this.pc.signalingState === 'have-remote-offer') {
-        await this.pc.setRemoteDescription({ type: 'rollback' });
+        try {
+          await this.pc.setRemoteDescription({ type: 'rollback' });
+        } catch (rollbackErr) {
+          this.logger.warn('SDP rollback failed after negotiation error', rollbackErr);
+        }
       }
       throw err;
     } finally {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/client/src/rtc/Subscriber.ts` around lines 204 - 211, The catch
block in Subscriber.ts where setRemoteDescription with rollback is called can
mask the original negotiation error if the rollback itself throws an exception.
Wrap the setRemoteDescription({ type: 'rollback' }) call in a nested try/catch
block to handle any rollback failure separately, then ensure the original err is
still thrown after the finally block executes. This way, if the rollback fails,
its error can be logged or handled without losing the context of the original
negotiation error that triggered the rollback attempt.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/client/src/rtc/Subscriber.ts`:
- Around line 204-211: The catch block in Subscriber.ts where
setRemoteDescription with rollback is called can mask the original negotiation
error if the rollback itself throws an exception. Wrap the
setRemoteDescription({ type: 'rollback' }) call in a nested try/catch block to
handle any rollback failure separately, then ensure the original err is still
thrown after the finally block executes. This way, if the rollback fails, its
error can be logged or handled without losing the context of the original
negotiation error that triggered the rollback attempt.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fb45dd31-7db5-4bc4-92ed-4ef3a77065b9

📥 Commits

Reviewing files that changed from the base of the PR and between fba9660 and 1d435f1.

📒 Files selected for processing (2)
  • packages/client/src/rtc/Subscriber.ts
  • packages/client/src/rtc/__tests__/Subscriber.test.ts

Reverts the organic ICE path migration detection added earlier on this
branch: the onIceDisconnected/shouldKeepScheduledIceRestart hooks in
BasePeerConnection and the candidate-pair snapshot/compare logic in
Publisher. When ICE returns to connected after a disconnect, the scheduled
restartICE is once again simply canceled, as on main.

Also removes the now-obsolete candidate-pair migration tests and the
ICE-transport mock helpers that only those tests used. BasePeerConnection,
Publisher, and webrtc.mocks are now identical to main.

The signal-close revival hardening in StreamSfuClient and Call is unrelated
and stays.
@oliverlaz oliverlaz changed the title feat(client): restart ICE on organic path migration and harden signal-close revival fix(client): harden SFU signal-close revival and subscriber ICE-restart recovery Jun 16, 2026
…ver bandwidth

Add CandidatePairMonitor, which watches the publisher's BUNDLE-shared ICE transport for selected-candidate-pair changes (e.g. WiFi to LTE) and restarts ICE so libwebrtc's network-route change re-seeds the send-side bandwidth estimate. Without this the estimate craters on a handover and ramps back over ~30s, starving the encoder and collapsing FPS. Validated via webrtc-internals: recovery in ~2-3s instead of ~30s.

A change restarts ICE only when it moves to a new interface (keyed by network-id) while ICE is healthy; same-interface churn, unhealthy-ICE changes, and returns to a recently-used path (flap) are ignored. On React Native this consumes the selectedcandidatepairchange event from the GetStream react-native-webrtc fork and is a safe no-op where the API is unavailable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant