For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
      • Word Timestamps
      • Language Detection
      • Utterances
      • Diarization
      • Redaction
      • Gender Detection
      • Emotion Detection
      • Keyword Boosting
      • Punctuation Formatting
      • End-of-Utterance Timeout
      • Inverse Text Normalization
      • Finalize Control
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Enabling sentence-level timestamps
  • Pre-Recorded API
  • Real-Time API (WebSocket)
  • Output format
  • Sample response
  • Pre-Recorded API
  • Real-Time API (WebSocket)
Speech to Text (Pulse)Features

Sentence-level timestamps

||View as Markdown|
Was this page helpful?
Previous

Language detection

Next

Speaker diarization

Built with
Pre-Recorded Real-Time

Sentence-level timestamps (utterances) are supported in both Pre-Recorded and Real-Time transcription APIs. The utterances array aggregates contiguous words into sentence-level segments, providing structured timing information for longer audio chunks.

Enabling sentence-level timestamps

Pre-Recorded API

For the Pre-Recorded API, set word_timestamps=true in your query parameters. When word timestamps are enabled, the response includes both words and utterances arrays.

Required dependency: word_timestamps=true must be enabled for utterances to appear in the response. Without it, the utterances array will be empty. For the Real-Time API, also set sentence_timestamps=true.

$# Download sample audio (or use your own file)
$curl -sL -o audio.wav "https://github.com/smallest-inc/cookbook/raw/main/speech-to-text/getting-started/samples/audio.wav"
$
$curl --request POST \
> --url "https://api.smallest.ai/waves/v1/pulse/get_text?language=en&word_timestamps=true&diarize=true" \
> --header "Authorization: Bearer $SMALLEST_API_KEY" \
> --header "Content-Type: audio/wav" \
> --data-binary "@audio.wav"

Real-Time API (WebSocket)

For the Real-Time WebSocket API, set sentence_timestamps=true as a query parameter when establishing the WebSocket connection.

1const url = new URL("wss://api.smallest.ai/waves/v1/pulse/get_text");
2url.searchParams.append("language", "en");
3url.searchParams.append("sentence_timestamps", "true");
4
5const ws = new WebSocket(url.toString(), {
6 headers: {
7 Authorization: `Bearer ${API_KEY}`,
8 },
9});

Output format

Each utterances entry contains text, start, end, and optional speaker fields (when diarization is enabled). Use these sentence-level timestamps when you need to display readable captions, synchronize larger chunks of audio, or store structured call summaries.

Sample response

Pre-Recorded API

1{
2 "status": "success",
3 "transcription": "Hello world. How are you?",
4 "words": {...}
5 "utterances": [
6 { "text": "Hello world.", "start": 0.0, "end": 0.9, "speaker": "speaker_0" },
7 { "text": "How are you?", "start": 1.0, "end": 2.1, "speaker": "speaker_1" }
8 ]
9}

This response has the speaker field due to diarize being enabled in the query.

Real-Time API (WebSocket)

1{
2 "session_id": "sess_12345abcde",
3 "transcript": "Hello world. How are you?",
4 "is_final": true,
5 "is_last": false,
6 "language": "en",
7 "utterances": [
8 { "text": "Hello world.", "start": 0.0, "end": 0.9 },
9 { "text": "How are you?", "start": 1.0, "end": 2.1 }
10 ]
11}

When diarize=true is enabled, the utterances array also includes a speaker field (integer ID) for real-time API responses. For example: { "text": "Hello world.", "start": 0.0, "end": 0.9, "speaker": 0 }