Skip to content

fix(soniox): prevent stream hang on server-side errors#5677

Open
morix1500 wants to merge 2 commits intolivekit:mainfrom
morix1500:fix/soniox-hang-on-server-error
Open

fix(soniox): prevent stream hang on server-side errors#5677
morix1500 wants to merge 2 commits intolivekit:mainfrom
morix1500:fix/soniox-hang-on-server-error

Conversation

@morix1500
Copy link
Copy Markdown

@morix1500 morix1500 commented May 7, 2026

When Soniox sends a WebSocket error frame such as 503 - Cannot continue request (code 4), the recv loop only logged the error and never reconnected, leaving the agent process hung indefinitely.

Root cause: the plugin defined _reconnect_event but never called _reconnect_event.set() anywhere in the codebase (dead code since the plugin's first commit), so the reconnection mechanism was effectively disabled.

This PR removes _reconnect_event and aligns the plugin with the Deepgram pattern — surface failures as APIError so the base class SpeechStream._main_task retry/backoff policy handles recovery.

Changes:

  • in-band error_code/error_message frames → APIStatusError
  • unexpected WS CLOSED/CLOSE/CLOSINGAPIStatusError
  • mid-stream aiohttp.ClientError from gather → APIConnectionError
  • normal end of session (finished frame followed by clean close) → raise an internal _SessionFinished so the surrounding finally cancels sibling tasks and _run returns cleanly without triggering a retry (prevents the gather hang where _send_audio_task blocks on an empty queue)
  • tolerate Soniox control frames that omit the tokens field
  • decorate audio/keepalive tasks with @utils.log_exceptions for uniform task-level logging

When Soniox sent a WebSocket error frame (e.g. "Cannot continue request"
with status 503), the recv loop only logged it and never reconnected,
which caused the agent process to hang.

Remove the dormant `_reconnect_event` (dead code since the plugin's first
commit) and align with the Deepgram pattern: surface in-band error frames,
unexpected WS closes, and mid-stream transport failures as `APIError` so
the base class `SpeechStream._main_task` retry/backoff policy applies.

- error_code/error_message frames -> raise APIStatusError
- unexpected WS CLOSED/CLOSE/CLOSING -> raise APIStatusError
- mid-stream aiohttp.ClientError from gather -> raise APIConnectionError
- finished frame followed by normal close -> return cleanly (no retry)
- tolerate Soniox control frames without a `tokens` field
- decorate tasks with @utils.log_exceptions for uniform task-level logging
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 7, 2026

CLA assistant check
All committers have signed the CLA.

devin-ai-integration[bot]

This comment was marked as resolved.

After the server sends `finished` and cleanly closes the WS, the recv
task previously returned normally. `asyncio.gather` then waited forever
for `_send_audio_task`, which was blocked on an empty `audio_queue`,
hanging the entire stream and any consumer iterating over it.

Raise a private `_SessionFinished` from the recv task and catch it in
`_run` so the surrounding finally block cancels sibling tasks via
`gracefully_cancel`, letting `_main_task` finish and `_event_ch` close.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants