Overview | Smallest AI Docs

The Speech-to-Text API transcribes audio via the unified endpoint https://api.smallest.ai/waves/v1/stt/. The model is selected via the ?model= query parameter:

?model=pulse: multilingual, supports streaming and pre-recorded transcription.
?model=pulse-pro: leaderboard-ranked English accuracy (5.42% ESB avg WER, tied #2 on the public Open ASR Leaderboard). Pre-recorded HTTP only; no streaming worker yet.

Live streaming runs on WS /waves/v1/stt/live?model=pulse. Pulse Pro on the live path returns 400 with a directive to use HTTP.

Pulse model card

Multilingual streaming + non-streaming. 35 documented languages (21 streaming + 26 pre-recorded) + regional aggregators.

Pulse Pro model card

English-only, leaderboard-ranked accuracy. Pre-recorded HTTP only.

Quickstart

Get started in minutes. Learn how to get your API key and transcribe your first audio file.

Transcription Modes

We offer two transcription modes to cover a wide range of use cases. Choose the one that best fits your needs:

Pre-Recorded

Transcribe audio files using synchronous HTTPS POST requests. Perfect for batch processing, archived media, and offline transcription workflows.

Real-Time

Stream audio and receive transcription results as the audio is processed. Ideal for live conversations, voice assistants, and low-latency applications.

Feature highlights

Our models specialize in processing audio to preserve information that is often lost during conventional speech to text conversion.

35 Languages & Automatic Language Detection

35 documented languages across streaming + pre-recorded modes (21 on streaming, 26 on pre-recorded). Set language to a known code (en, hi, ta, etc.) for best accuracy, or use a regional aggregator for unknown audio: multi-eu on batch, north_indic or multi-south-indic on streaming, and multi-asian (streaming + US region only on the WebSocket surface). See the Pulse model card for the full list with regional notes.

Word Timestamps

Get precise timing information for each word in the transcription. Enables caption generation, subtitle tracks, and time-based search within audio content.

Sentence Timestamps (Utterances)

Receive sentence-level transcription segments with timing information. Perfect for displaying readable captions, synchronizing larger chunks of audio, or storing structured call summaries.

Diarization

Identify and separate generated text into speaker turns. Automatically label different speakers in multi-speaker audio, enabling speaker-attributed transcription.

Gender Detection

Detect the gender of each speaker alongside transcription. Provides demographic insights for analytics and content analysis.

Emotion Detection

Detect emotional tone in transcribed speech with strength indicators for 5 core emotion types. Analyze sentiment and emotional context in conversations.

PII & PCI Redaction

Automatically redact personally identifiable information (names, addresses, phone numbers) and payment card information (credit cards, CVV, account numbers) to protect privacy and ensure compliance.

Low Latency

Streaming pipeline tuned for ~64 ms time to first transcript latency. Optimized for real-time transcription with minimal delay.

Supported languages

The full per-mode language matrix lives on the Pulse model card — that page is the single source of truth, this page summarises the high-level shape.

Streaming (Real-Time, WebSocket) — 21 single-language codes + 3 regional aggregators:

en, hi, de, es, ru, it, fr, nl, pt, zh, yue, ja, ko, gu, mr, or, bn, ta, te, kn, ml, plus north_indic (auto-detects across en/hi/gu/mr/bn/or), multi-asian (auto-detects across zh/yue/ko/ja/en; US region only — contact sales for access in the India region), and multi-south-indic (auto-detects across ta/te/kn/ml + English code-switching; India region only — wss://api.smallest.ai/...; US endpoint returns LANGUAGE_NOT_ENABLED_IN_REGION).

Non-Streaming (Pre-Recorded, HTTP) — 26 single-language codes + 2 regional aggregators:

en, hi, de, es, ru, it, fr, nl, pt, uk, pl, cs, sk, lv, et, ro, fi, sv, bg, hu, da, lt, mt, zh, ja, ko, plus multi-eu (auto-detects across all 21 European codes plus en) and multi-asian (auto-detects across zh/ko/ja/en).

East Asian streaming languages (zh, yue, ja, ko, multi-asian) are served from the US region only. Connect to wss://api.us.smallest.ai/... for these.

South Indian streaming languages (ta, te, kn, ml, multi-south-indic) are served from the India region only. Connect to wss://api.smallest.ai/... for these; the US endpoint rejects them with LANGUAGE_NOT_ENABLED_IN_REGION.

Next steps

Send your first POST request in the Pulse STT Pre-Recorded quickstart.
Start your first WebSocket connection in the Pulse STT WebSocket quickstart.
See the Pulse model card for benchmarks, capabilities, and pricing.
Review best practices for audio preprocessing and request hygiene.
Use the troubleshooting guide when you need quick fixes.