***
title: Speaker diarization
description: Label each word and utterance with turn-by-turn speaker IDs
------------------------------------------------------------------------
Pre-Recorded
Real-Time
## Enabling speaker diarization
### Pre-Recorded API
Pass `diarize=true` when calling the Pulse STT POST endpoint. The parameter can be combined with other enrichment options (timestamps, emotions, etc.) without changing your audio payload.
```bash
curl --request POST \
--url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true" \
--header "Authorization: Bearer $SMALLEST_API_KEY" \
--header "Content-Type: audio/wav" \
--data-binary "@/path/to/audio.wav"
```
### Real-Time WebSocket API
Add `diarize=true` to your WebSocket connection query parameters when connecting to the Pulse STT WebSocket API.
```javascript
const url = new URL("wss://waves-api.smallest.ai/api/v1/pulse/get_text");
url.searchParams.append("language", "en");
url.searchParams.append("encoding", "linear16");
url.searchParams.append("sample_rate", "16000");
url.searchParams.append("diarize", "true");
const ws = new WebSocket(url.toString(), {
headers: {
Authorization: `Bearer ${API_KEY}`,
},
});
```
## Output format & field of interest
When enabled, every entry in `words` includes a `speaker` field (integer ID: `0`, `1`, …) and `speaker_confidence` field (0.0 to 1.0) for real-time API, or string labels (`speaker_0`, `speaker_1`, …) for pre-recorded API. The `utterances` array also carries `speaker` labels so you can reconstruct conversations, build turn-taking analytics, or display multi-speaker captions.
### Pre-Recorded API
## Sample request
```bash
curl --request POST \
--url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true" \
--header "Authorization: Bearer $SMALLEST_API_KEY" \
--header "Content-Type: audio/wav" \
--data-binary "@/path/to/two-speaker.wav"
```
## Sample response
### Pre-Recorded API Response
```json
{
"transcription": "Agent: Hello world. Customer: Hi there.",
"words": [
{ "start": 0.0, "end": 0.4, "speaker": "speaker_0", "word": "Hello" },
{ "start": 0.4, "end": 0.8, "speaker": "speaker_0", "word": "world." },
{ "start": 1.0, "end": 1.2, "speaker": "speaker_1", "word": "Hi" },
{ "start": 1.2, "end": 1.6, "speaker": "speaker_1", "word": "there." }
],
"utterances": [
{ "text": "Hello world.", "start": 0.0, "end": 0.8, "speaker": "speaker_0" },
{ "text": "Hi there.", "start": 1.0, "end": 1.6, "speaker": "speaker_1" }
]
}
```
### Real-Time WebSocket API Response
```json
{
"session_id": "sess_12345abcde",
"transcript": "Hello world. Hi there.",
"is_final": true,
"is_last": false,
"language": "en",
"words": [
{
"word": "Hello",
"start": 0.0,
"end": 0.4,
"confidence": 0.98,
"speaker": 0,
"speaker_confidence": 0.95
},
{
"word": "world.",
"start": 0.4,
"end": 0.8,
"confidence": 0.97,
"speaker": 0,
"speaker_confidence": 0.92
},
{
"word": "Hi",
"start": 1.0,
"end": 1.2,
"confidence": 0.99,
"speaker": 1,
"speaker_confidence": 0.88
},
{
"word": "there.",
"start": 1.2,
"end": 1.6,
"confidence": 0.96,
"speaker": 1,
"speaker_confidence": 0.91
}
],
"utterances": [
{
"text": "Hello world.",
"start": 0.0,
"end": 0.8,
"speaker": 0
},
{
"text": "Hi there.",
"start": 1.0,
"end": 1.6,
"speaker": 1
}
]
}
```
## Response Fields
|
Field
|
Type
|
When Included
|
Description
|
|
`speaker`
|
integer (realtime) / string (pre-recorded)
|
`diarize=true`
|
Speaker label. Real-time API uses integer IDs (0, 1, ...), pre-recorded API uses string labels (speaker_0, speaker_1, ...)
|
|
`speaker_confidence`
|
number
|
`diarize=true`
(realtime only)
|
Confidence score for the speaker assignment (0.0 to 1.0)
|