Language detection

View as Markdown
Pre-Recorded Real-Time

Enabling language detection

Set the language query parameter to a regional aggregator when calling the API. Pulse will auto-detect the spoken language across the codes covered by that aggregator:

  • multi-eu — European set
  • multi-indic — Indic set
  • multi-asian — East Asian set (streaming + US region only)

Pick the aggregator that matches the audio you expect; multi-eu is the safer default for unknown European audio. When you already know the source language, pass the explicit code (en, hi, es, etc.) for the best accuracy.

For the exact codes inside each aggregator + region notes, see the Pulse model card — the single source of truth for which languages are supported on streaming vs batch.

Pre-Recorded API

$# Download sample audio (or use your own file)
$curl -sL -o audio.wav "https://github.com/smallest-inc/cookbook/raw/main/speech-to-text/getting-started/samples/audio.wav"
$
$curl --request POST \
> --url "https://api.smallest.ai/waves/v1/stt/?model=pulse&language=multi-eu&word_timestamps=true" \
> --header "Authorization: Bearer $SMALLEST_API_KEY" \
> --header "Content-Type: audio/wav" \
> --data-binary "@audio.wav"

Real-Time WebSocket API

1const url = new URL("wss://api.smallest.ai/waves/v1/stt/live?model=pulse");
2url.searchParams.append("language", "multi-eu");
3url.searchParams.append("encoding", "linear16");
4url.searchParams.append("sample_rate", "16000");
5
6const ws = new WebSocket(url.toString(), {
7 headers: {
8 Authorization: `Bearer ${API_KEY}`,
9 },
10});

Output format & field of interest

When language detection is enabled, the transcription (or transcript for realtime), words, and utterances arrays are emitted in the detected language. The response includes a language field with the detected primary language code, and a languages array (in realtime responses where is_final=true) listing all detected languages. Persist the detected locale in your app by storing the language parameter you supplied (for auditing) and by inspecting downstream metadata such as subtitles or captions that inherit the localized transcript.

Sample response

Pre-Recorded API Response

1{
2 "status": "success",
3 "transcription": "Hola mundo.",
4 "words": [
5 { "start": 0.0, "end": 0.4, "word": "Hola" },
6 { "start": 0.5, "end": 0.9, "word": "mundo." }
7 ],
8 "utterances": [
9 { "text": "Hola mundo.", "start": 0.0, "end": 0.9 }
10 ]
11}

Real-Time WebSocket API Response

1{
2 "session_id": "sess_12345abcde",
3 "transcript": "Hola mundo.",
4 "is_final": true,
5 "is_last": false,
6 "language": "es",
7 "languages": ["es"]
8}

The language field is only returned when is_final=true in real-time API responses. The languages array lists all languages detected in the audio and is also only included when is_final=true.