Voice Activity (VAD) events
Voice Activity (VAD) events
When vad_events=true is set on the WebSocket connection, the Pulse STT server emits two additional JSON message types: speech_started and speech_ended. They are interleaved with the transcription stream on the same socket. Event boundaries are derived from the audio signal and are independent of transcript finalization.
VAD events are WebSocket only. The pre-recorded REST endpoint does not emit them.
Enabling
Set vad_events=true on the WebSocket URL. Default is false.
vad=true is accepted as an alias. When both are set, vad_events takes precedence and vad is ignored.
Message sequence
Boundaries are acoustic, not lexical. They are computed from the audio signal independently of transcript finalization, so they do not coincide with is_final transcript turns.
Event payloads
The two message types are interleaved with transcription messages on the same socket. Discriminate on the type field.
speech_started
Emitted when speech onset is detected after silence.
speech_ended
Emitted when a voiced region ends and silence is detected.
Handling
Discriminate on the type field. transcription messages route through the default branch unchanged.
Notes
- Opt-in. When
vad_eventsis unset orfalse, the connection emits onlytranscriptionmessages. - Acoustic, not lexical. Boundaries are derived from the audio signal, independent of
is_final. They do not coincide with word boundaries. speech_endedrequires trailing silence. The event is emitted when silence is detected after a voiced region. If the connection closes while the audio still contains voiced energy, nospeech_endedis emitted for the final voiced region. This happens, for example, whenclose_streamis sent immediately after the last word of an audio file. To force emission at the end of a finite audio file, append a short pad (about 1 second) of zero-valued PCM before sendingclose_stream.timestampis measured in seconds from the first audio frame received on the connection.- Sample rate. The
sample_ratequery parameter must equal the rate of the PCM frames sent on the connection. Mismatch causes timing drift on both transcripts and VAD timestamps. - No tuning parameters. Sensitivity threshold and debounce use model-side defaults; the API exposes no per-connection tuning.

