Word timestamps

View as Markdown
Pre-Recorded Real-Time

Word timestamps provide precise timing information for each word in the transcription, enabling you to generate captions, subtitles, and align transcripts with audio playback. Use these offsets to generate captions, subtitle tracks, or to align transcripts with downstream analytics.

Enabling Word Timestamps

Pre-Recorded API

Add word_timestamps=true to your query parameters. Pulse Pro accepts raw-byte uploads only (Content-Type: application/octet-stream); Pulse also accepts JSON requests with a hosted audio URL.

Sample request

$# Download sample audio (or use your own file)
$curl -sL -o audio.wav "https://github.com/smallest-inc/cookbook/raw/main/speech-to-text/getting-started/samples/audio.wav"
$
$curl --request POST \
> --url "https://api.smallest.ai/waves/v1/stt/?model=pulse-pro&language=en&word_timestamps=true" \
> --header "Authorization: Bearer $SMALLEST_API_KEY" \
> --header "Content-Type: application/octet-stream" \
> --data-binary "@audio.wav"

Real-Time WebSocket API

Add word_timestamps=true to your WebSocket connection query parameters when connecting to the Pulse STT WebSocket API.

1const url = new URL("wss://api.smallest.ai/waves/v1/stt/live?model=pulse");
2url.searchParams.append("language", "en");
3url.searchParams.append("encoding", "linear16");
4url.searchParams.append("sample_rate", "16000");
5url.searchParams.append("word_timestamps", "true");
6
7const ws = new WebSocket(url.toString(), {
8 headers: {
9 Authorization: `Bearer ${API_KEY}`,
10 },
11});

Output format & field of interest

Responses include a words array with word, start, end, and confidence fields. When diarization is enabled, the array also includes speaker (integer ID for realtime, string label for pre-recorded) and speaker_confidence (0.0 to 1.0, realtime only) fields.

Pre-Recorded API Response

Pulse Pro response — words array with per-word timing and confidence; no utterances field:

1{
2 "status": "success",
3 "transcription": "Hello world.",
4 "words": [
5 { "word": "Hello", "start": 0.0, "end": 0.5, "confidence": 0.98 },
6 { "word": "world.", "start": 0.6, "end": 0.9, "confidence": 0.96 }
7 ],
8 "language": "en",
9 "metadata": { "duration": 0.9, "processing_time_ms": 42.5, "rtfx": 21.2, "num_chunks": 1 },
10 "request_id": "8c355f4d-bd45-48ee-aa83-d00e4670f6bb"
11}

Pulse response — also includes an utterances array with sentence-level timestamps:

1{
2 "status": "success",
3 "transcription": "Hello world.",
4 "words": [
5 { "start": 0.0, "end": 0.5, "speaker": "speaker_0", "word": "Hello" },
6 { "start": 0.6, "end": 0.9, "speaker": "speaker_0", "word": "world." }
7 ],
8 "utterances": [
9 { "text": "Hello world.", "start": 0.0, "end": 0.9, "speaker": "speaker_0" }
10 ]
11}

Real-Time WebSocket API Response

1{
2 "type": "transcription",
3 "status": "success",
4 "session_id": "00000000-0000-0000-0000-000000000001",
5 "transcript": "Hello, how are you?",
6 "is_final": true,
7 "is_last": false,
8 "language": "en",
9 "words": [
10 {
11 "word": "Hello",
12 "start": 0.0,
13 "end": 0.5,
14 "confidence": 0.98
15 },
16 {
17 "word": "how",
18 "start": 0.6,
19 "end": 0.8,
20 "confidence": 0.95
21 },
22 {
23 "word": "are",
24 "start": 0.8,
25 "end": 1.0,
26 "confidence": 0.97
27 },
28 {
29 "word": "you?",
30 "start": 1.0,
31 "end": 1.3,
32 "confidence": 0.99
33 }
34 ]
35}

When diarize=true is enabled, the words array also includes speaker (integer ID) and speaker_confidence (0.0 to 1.0) fields.

Response Fields

FieldTypeWhen IncludedDescription
wordstringword_timestamps=trueThe transcribed word
startnumberword_timestamps=trueStart time in seconds
endnumberword_timestamps=trueEnd time in seconds
confidencenumberword_timestamps=trueConfidence score for the word (0.0 to 1.0). Returned by both pre-recorded and realtime.
speakerinteger (realtime) / string (pre-recorded)diarize=trueSpeaker label. Real-time API uses integer IDs (0, 1, …), pre-recorded API uses string labels (speaker_0, speaker_1, …)
speaker_confidencenumberdiarize=true (realtime only)Confidence score for the speaker assignment (0.0 to 1.0)

Use Cases

  • Caption generation: Create synchronized captions for video or live streams
  • Subtitle tracks: Generate SRT or VTT subtitle files
  • Analytics: Align transcripts with audio playback for detailed analysis
  • Search: Enable time-based search within audio content