Voice Activity (VAD) events

View as Markdown
Real-Time

When vad_events=true is set on the WebSocket connection, the Pulse STT server emits two additional JSON message types: speech_started and speech_ended. They are interleaved with the transcription stream on the same socket. Event boundaries are derived from the audio signal and are independent of transcript finalization.

VAD events are WebSocket only. The pre-recorded REST endpoint does not emit them.

Enabling

Set vad_events=true on the WebSocket URL. Default is false.

1const url = new URL("wss://api.smallest.ai/waves/v1/stt/live?model=pulse");
2url.searchParams.append("language", "en");
3url.searchParams.append("encoding", "linear16");
4url.searchParams.append("sample_rate", "16000");
5url.searchParams.append("vad_events", "true");
6
7const ws = new WebSocket(url.toString(), {
8 headers: { Authorization: `Bearer ${API_KEY}` },
9});

vad=true is accepted as an alias. When both are set, vad_events takes precedence and vad is ignored.

Message sequence

Boundaries are acoustic, not lexical. They are computed from the audio signal independently of transcript finalization, so they do not coincide with is_final transcript turns.

Event payloads

The two message types are interleaved with transcription messages on the same socket. Discriminate on the type field.

speech_started

Emitted when speech onset is detected after silence.

1{
2 "type": "speech_started",
3 "session_id": "a1b2c3d4",
4 "timestamp": 1.84
5}

speech_ended

Emitted when a voiced region ends and silence is detected.

1{
2 "type": "speech_ended",
3 "session_id": "a1b2c3d4",
4 "timestamp": 4.52
5}
FieldTypeDescription
typestringMessage-type discriminator. One of speech_started, speech_ended.
session_idstringSession identifier. Matches the session_id returned on transcription messages from the same connection.
timestampnumberSeconds, measured from the first audio frame received on the connection.

Handling

Discriminate on the type field. transcription messages route through the default branch unchanged.

1ws.onmessage = (e) => {
2 const m = JSON.parse(e.data);
3 switch (m.type) {
4 case "speech_started":
5 onSpeechStart(m.timestamp);
6 break;
7 case "speech_ended":
8 onSpeechEnd(m.timestamp);
9 break;
10 case "transcription":
11 default:
12 handleTranscript(m);
13 break;
14 }
15};
1import json
2
3async for raw in ws:
4 m = json.loads(raw)
5 t = m.get("type")
6 if t == "speech_started":
7 on_speech_start(m["timestamp"])
8 elif t == "speech_ended":
9 on_speech_end(m["timestamp"])
10 else:
11 handle_transcript(m)

Notes

  • Opt-in. When vad_events is unset or false, the connection emits only transcription messages.
  • Acoustic, not lexical. Boundaries are derived from the audio signal, independent of is_final. They do not coincide with word boundaries.
  • speech_ended requires trailing silence. The event is emitted when silence is detected after a voiced region. If the connection closes while the audio still contains voiced energy, no speech_ended is emitted for the final voiced region. This happens, for example, when close_stream is sent immediately after the last word of an audio file. To force emission at the end of a finite audio file, append a short pad (about 1 second) of zero-valued PCM before sending close_stream.
  • timestamp is measured in seconds from the first audio frame received on the connection.
  • Sample rate. The sample_rate query parameter must equal the rate of the PCM frames sent on the connection. Mismatch causes timing drift on both transcripts and VAD timestamps.
  • No tuning parameters. Sensitivity threshold and debounce use model-side defaults; the API exposes no per-connection tuning.