Skip to content

fix(events): SSE health-check watchdog to detect silently-dead streams#123

Open
avfirsov wants to merge 1 commit into
grinev:mainfrom
avfirsov:fix/sse-watchdog
Open

fix(events): SSE health-check watchdog to detect silently-dead streams#123
avfirsov wants to merge 1 commit into
grinev:mainfrom
avfirsov:fix/sse-watchdog

Conversation

@avfirsov
Copy link
Copy Markdown
Contributor

Problem

The SSE for-await loop cannot detect a stream that died without throwing — it sits idle waiting for events that never arrive, and the reconnect logic only fires when the stream explicitly ends or errors. A corporate proxy (e.g. cntlm) cutting off long-running connections, a broken pipe that never surfaces an error, or a tight reconnect loop with no successful event delivery can all leave the bot silent indefinitely.

Changes

src/opencode/events.ts — track stream health and expose getters:

  • lastSseEventTime is refreshed on every received event;
  • consecutiveReconnectAttempts is incremented in both reconnect paths (stream-ended and error) and reset whenever an event arrives;
  • exports getLastSseEventTime, getConsecutiveReconnectAttempts, isEventListening, getActiveEventDirectory.

src/bot/index.ts — add a 30-second sseWatchdogTimer:

  • if isEventListening() is true AND we have not seen an event for >30s OR have piled up ≥5 reconnect attempts → stopEventListening() + ensureEventSubscription(directory);
  • cleared in both createBot() and cleanupBotRuntime() alongside the existing heartbeat timer.

Test plan

  • npm run build clean (verified locally)
  • Start bot, normal traffic → watchdog stays quiet (event timestamps keep refreshing)
  • Kill opencode serve while bot is connected → reconnects pile up → watchdog logs [SSE Watchdog] Restarting… → on opencode serve restart, subscription resumes without bot restart

The SSE for-await loop cannot detect a stream that died without throwing —
the loop just sits idle waiting for events that never arrive, and the
reconnect logic only fires when the stream explicitly ends or errors.
A corporate proxy (e.g. cntlm) cutting off long-running connections, a
broken pipe that never surfaces an error, or a tight reconnect loop with
no successful event delivery can all leave the bot silent indefinitely.

Changes:
- events.ts: track lastSseEventTime (refreshed on every received event)
  and consecutiveReconnectAttempts (incremented in both reconnect paths,
  reset on event). Export getLastSseEventTime, getConsecutiveReconnectAttempts,
  isEventListening, getActiveEventDirectory.
- index.ts: add sseWatchdogTimer that fires every 30s. If we have not
  seen an event for >30s or have piled up >=5 reconnect attempts while
  isEventListening() is true, stopEventListening() + restart via
  ensureEventSubscription(directory). Cleared in createBot() and
  cleanupBotRuntime().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant