*** title: Speaker diarization description: Label each word and utterance with turn-by-turn speaker IDs ------------------------------------------------------------------------ Pre-Recorded Real-Time ## Enabling speaker diarization ### Pre-Recorded API Pass `diarize=true` when calling the Pulse STT POST endpoint. The parameter can be combined with other enrichment options (timestamps, emotions, etc.) without changing your audio payload. ```bash curl --request POST \ --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true" \ --header "Authorization: Bearer $SMALLEST_API_KEY" \ --header "Content-Type: audio/wav" \ --data-binary "@/path/to/audio.wav" ``` ### Real-Time WebSocket API Add `diarize=true` to your WebSocket connection query parameters when connecting to the Pulse STT WebSocket API. ```javascript const url = new URL("wss://waves-api.smallest.ai/api/v1/pulse/get_text"); url.searchParams.append("language", "en"); url.searchParams.append("encoding", "linear16"); url.searchParams.append("sample_rate", "16000"); url.searchParams.append("diarize", "true"); const ws = new WebSocket(url.toString(), { headers: { Authorization: `Bearer ${API_KEY}`, }, }); ``` ## Output format & field of interest When enabled, every entry in `words` includes a `speaker` field (integer ID: `0`, `1`, …) and `speaker_confidence` field (0.0 to 1.0) for real-time API, or string labels (`speaker_0`, `speaker_1`, …) for pre-recorded API. The `utterances` array also carries `speaker` labels so you can reconstruct conversations, build turn-taking analytics, or display multi-speaker captions. ### Pre-Recorded API ## Sample request ```bash curl --request POST \ --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true" \ --header "Authorization: Bearer $SMALLEST_API_KEY" \ --header "Content-Type: audio/wav" \ --data-binary "@/path/to/two-speaker.wav" ``` ## Sample response ### Pre-Recorded API Response ```json { "transcription": "Agent: Hello world. Customer: Hi there.", "words": [ { "start": 0.0, "end": 0.4, "speaker": "speaker_0", "word": "Hello" }, { "start": 0.4, "end": 0.8, "speaker": "speaker_0", "word": "world." }, { "start": 1.0, "end": 1.2, "speaker": "speaker_1", "word": "Hi" }, { "start": 1.2, "end": 1.6, "speaker": "speaker_1", "word": "there." } ], "utterances": [ { "text": "Hello world.", "start": 0.0, "end": 0.8, "speaker": "speaker_0" }, { "text": "Hi there.", "start": 1.0, "end": 1.6, "speaker": "speaker_1" } ] } ``` ### Real-Time WebSocket API Response ```json { "session_id": "sess_12345abcde", "transcript": "Hello world. Hi there.", "is_final": true, "is_last": false, "language": "en", "words": [ { "word": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.98, "speaker": 0, "speaker_confidence": 0.95 }, { "word": "world.", "start": 0.4, "end": 0.8, "confidence": 0.97, "speaker": 0, "speaker_confidence": 0.92 }, { "word": "Hi", "start": 1.0, "end": 1.2, "confidence": 0.99, "speaker": 1, "speaker_confidence": 0.88 }, { "word": "there.", "start": 1.2, "end": 1.6, "confidence": 0.96, "speaker": 1, "speaker_confidence": 0.91 } ], "utterances": [ { "text": "Hello world.", "start": 0.0, "end": 0.8, "speaker": 0 }, { "text": "Hi there.", "start": 1.0, "end": 1.6, "speaker": 1 } ] } ``` ## Response Fields
Field Type When Included Description
`speaker` integer (realtime) / string (pre-recorded) `diarize=true` Speaker label. Real-time API uses integer IDs (0, 1, ...), pre-recorded API uses string labels (speaker_0, speaker_1, ...)
`speaker_confidence` number `diarize=true` (realtime only) Confidence score for the speaker assignment (0.0 to 1.0)