feat(stt): forward session VAD events to STT plugins#5644
feat(stt): forward session VAD events to STT plugins#5644sam-s10s wants to merge 1 commit intolivekit:mainfrom
Conversation
Add STT.on_vad_event() hook and have AudioRecognition forward VAD events to the active STT instance, enabling plugins to react to session-level VAD (e.g. finalize on END_OF_SPEECH for externally- driven turn detection modes).
| if (stt_inst := self._session.stt) is not None: | ||
| try: | ||
| stt_inst.on_vad_event(ev) | ||
| except Exception: | ||
| logger.exception("error forwarding VAD event to STT") |
There was a problem hiding this comment.
🟡 VAD events forwarded to session-level STT instead of the active STT instance
The code at audio_recognition.py:885 uses self._session.stt to forward VAD events, but the active STT (the one actually processing audio) is resolved by agent_activity.py:3629-3630 as self._agent.stt if is_given(self._agent.stt) else self._session.stt. When a user configures the agent with its own STT via Agent(stt=my_stt), the active STT is the agent's instance, not the session's. In this case, self._session.stt may return a different STT instance (or None if only the agent has an STT), so the on_vad_event call either reaches the wrong instance or doesn't happen at all. This means any plugin that overrides on_vad_event (the stated purpose of this PR) won't receive events when STT is set at the agent level.
Prompt for agents
The issue is in `_on_vad_event` in `audio_recognition.py`. The code forwards VAD events to `self._session.stt`, but the active STT may be the agent-level one (resolved via `agent_activity.stt` property at `agent_activity.py:3629-3630`).
The `AudioRecognition` class currently only holds a reference to the `AgentSession` (via `self._session`), not the `AgentActivity` or `Agent`. To fix this, you could either:
1. Store a reference to the active STT instance (the `stt.STT` object, not just the `io.STTNode` callable) in `AudioRecognition` and update it when `update_stt` is called. For example, add an optional `stt_instance: stt.STT | None` parameter.
2. Have `AudioRecognition.__init__` or a new setter accept the active STT instance, and have `AgentActivity` pass `self.stt` (which correctly resolves agent vs session STT).
3. Access the active STT through the session's current activity, though this would add coupling.
The goal is to ensure `on_vad_event` is called on the same STT instance that the default `stt_node` uses (i.e., `activity.stt`).
Was this helpful? React with 👍 or 👎 to provide feedback.
|
Any reason why you need that? Is it required for some STT? |
Currently STT providers do not recieve any external VAD events (e.g. from Silero VAD).
The proposal is to add STT.on_vad_event() hook and have AudioRecognition forward VAD events to the active STT instance, enabling plugins to react to session-level VAD (e.g. finalize on END_OF_SPEECH for externally- driven turn detection modes).
Example code: