Speaker diarization

View as MarkdownOpen in Claude
Pre-Recorded Real-Time

Enabling speaker diarization

Pre-Recorded API

Pass diarize=true when calling the Pulse STT POST endpoint. The parameter can be combined with other enrichment options (timestamps, emotions, etc.) without changing your audio payload.

$curl --request POST \
> --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true" \
> --header "Authorization: Bearer $SMALLEST_API_KEY" \
> --header "Content-Type: audio/wav" \
> --data-binary "@/path/to/audio.wav"

Real-Time WebSocket API

Add diarize=true to your WebSocket connection query parameters when connecting to the Pulse STT WebSocket API.

1const url = new URL("wss://waves-api.smallest.ai/api/v1/pulse/get_text");
2url.searchParams.append("language", "en");
3url.searchParams.append("encoding", "linear16");
4url.searchParams.append("sample_rate", "16000");
5url.searchParams.append("diarize", "true");
6
7const ws = new WebSocket(url.toString(), {
8 headers: {
9 Authorization: `Bearer ${API_KEY}`,
10 },
11});

Output format & field of interest

When enabled, every entry in words includes a speaker field (integer ID: 0, 1, …) and speaker_confidence field (0.0 to 1.0) for real-time API, or string labels (speaker_0, speaker_1, …) for pre-recorded API. The utterances array also carries speaker labels so you can reconstruct conversations, build turn-taking analytics, or display multi-speaker captions.

Pre-Recorded API

Sample request

$curl --request POST \
> --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true" \
> --header "Authorization: Bearer $SMALLEST_API_KEY" \
> --header "Content-Type: audio/wav" \
> --data-binary "@/path/to/two-speaker.wav"

Sample response

Pre-Recorded API Response

1{
2 "transcription": "Agent: Hello world. Customer: Hi there.",
3 "words": [
4 { "start": 0.0, "end": 0.4, "speaker": "speaker_0", "word": "Hello" },
5 { "start": 0.4, "end": 0.8, "speaker": "speaker_0", "word": "world." },
6 { "start": 1.0, "end": 1.2, "speaker": "speaker_1", "word": "Hi" },
7 { "start": 1.2, "end": 1.6, "speaker": "speaker_1", "word": "there." }
8 ],
9 "utterances": [
10 { "text": "Hello world.", "start": 0.0, "end": 0.8, "speaker": "speaker_0" },
11 { "text": "Hi there.", "start": 1.0, "end": 1.6, "speaker": "speaker_1" }
12 ]
13}

Real-Time WebSocket API Response

1{
2 "session_id": "sess_12345abcde",
3 "transcript": "Hello world. Hi there.",
4 "is_final": true,
5 "is_last": false,
6 "language": "en",
7 "words": [
8 {
9 "word": "Hello",
10 "start": 0.0,
11 "end": 0.4,
12 "confidence": 0.98,
13 "speaker": 0,
14 "speaker_confidence": 0.95
15 },
16 {
17 "word": "world.",
18 "start": 0.4,
19 "end": 0.8,
20 "confidence": 0.97,
21 "speaker": 0,
22 "speaker_confidence": 0.92
23 },
24 {
25 "word": "Hi",
26 "start": 1.0,
27 "end": 1.2,
28 "confidence": 0.99,
29 "speaker": 1,
30 "speaker_confidence": 0.88
31 },
32 {
33 "word": "there.",
34 "start": 1.2,
35 "end": 1.6,
36 "confidence": 0.96,
37 "speaker": 1,
38 "speaker_confidence": 0.91
39 }
40 ],
41 "utterances": [
42 {
43 "text": "Hello world.",
44 "start": 0.0,
45 "end": 0.8,
46 "speaker": 0
47 },
48 {
49 "text": "Hi there.",
50 "start": 1.0,
51 "end": 1.6,
52 "speaker": 1
53 }
54 ]
55}

Response Fields

FieldTypeWhen IncludedDescription
speakerinteger (realtime) / string (pre-recorded)diarize=trueSpeaker label. Real-time API uses integer IDs (0, 1, …), pre-recorded API uses string labels (speaker_0, speaker_1, …)
speaker_confidencenumberdiarize=true (realtime only)Confidence score for the speaker assignment (0.0 to 1.0)