Voice Activity (VAD) events | Smallest AI Docs

Real-Time

When vad_events=true is set on the WebSocket connection, the Pulse STT server emits two additional JSON message types: speech_started and speech_ended. They are interleaved with the transcription stream on the same socket. Event boundaries are derived from the audio signal and are independent of transcript finalization.

VAD events are WebSocket only. The pre-recorded REST endpoint does not emit them.

Enabling

Set vad_events=true on the WebSocket URL. Default is false.

1 const url = new URL("wss://api.smallest.ai/waves/v1/stt/live?model=pulse");
2 url.searchParams.append("language", "en");
3 url.searchParams.append("encoding", "linear16");
4 url.searchParams.append("sample_rate", "16000");
5 url.searchParams.append("vad_events", "true");
6 
7 const ws = new WebSocket(url.toString(), {
8   headers: { Authorization: `Bearer ${API_KEY}` },
9 });

vad=true is accepted as an alias. When both are set, vad_events takes precedence and vad is ignored.

Message sequence

Boundaries are acoustic, not lexical. They are computed from the audio signal independently of transcript finalization, so they do not coincide with is_final transcript turns.

Event payloads

The two message types are interleaved with transcription messages on the same socket. Discriminate on the type field.

`speech_started`

Emitted when speech onset is detected after silence.

1 {
2   "type": "speech_started",
3   "session_id": "a1b2c3d4",
4   "timestamp": 1.84
5 }

`speech_ended`

Emitted when a voiced region ends and silence is detected.

1 {
2   "type": "speech_ended",
3   "session_id": "a1b2c3d4",
4   "timestamp": 4.52
5 }

Field	Type	Description
`type`	string	Message-type discriminator. One of `speech_started`, `speech_ended`.
`session_id`	string	Session identifier. Matches the `session_id` returned on `transcription` messages from the same connection.
`timestamp`	number	Seconds, measured from the first audio frame received on the connection.

Handling

Discriminate on the type field. transcription messages route through the default branch unchanged.

1 ws.onmessage = (e) => {
2   const m = JSON.parse(e.data);
3   switch (m.type) {
4     case "speech_started":
5       onSpeechStart(m.timestamp);
6       break;
7     case "speech_ended":
8       onSpeechEnd(m.timestamp);
9       break;
10     case "transcription":
11     default:
12       handleTranscript(m);
13       break;
14   }
15 };

1 import json
2 
3 async for raw in ws:
4     m = json.loads(raw)
5     t = m.get("type")
6     if t == "speech_started":
7         on_speech_start(m["timestamp"])
8     elif t == "speech_ended":
9         on_speech_end(m["timestamp"])
10     else:
11         handle_transcript(m)

Notes

Opt-in. When vad_events is unset or false, the connection emits only transcription messages.
Acoustic, not lexical. Boundaries are derived from the audio signal, independent of is_final. They do not coincide with word boundaries.
speech_ended requires trailing silence. The event is emitted when silence is detected after a voiced region. If the connection closes while the audio still contains voiced energy, no speech_ended is emitted for the final voiced region. This happens, for example, when close_stream is sent immediately after the last word of an audio file. To force emission at the end of a finite audio file, append a short pad (about 1 second) of zero-valued PCM before sending close_stream.
timestamp is measured in seconds from the first audio frame received on the connection.
Sample rate. The sample_rate query parameter must equal the rate of the PCM frames sent on the connection. Mismatch causes timing drift on both transcripts and VAD timestamps.
No tuning parameters. Sensitivity threshold and debounce use model-side defaults; the API exposes no per-connection tuning.