For AI agents: a documentation index is available at the root level at /llms.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
LogoLogo
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
    • General
    • Lightning v3.1
    • Pulse STT
    • Hydra
  • General
  • June 12, 2026
  • June 3, 2026
  • May 23, 2026
  • May 22, 2026
  • May 22, 2026
  • May 12, 2026
  • May 7, 2026
  • April 22, 2026
  • April 20, 2026
  • Lightning v3.1
  • June 15, 2026
  • June 5, 2026
  • June 1, 2026
  • May 19, 2026
  • May 19, 2026
  • May 15, 2026
  • May 14, 2026
  • May 8, 2026
  • May 2, 2026
  • May 2, 2026
  • Pulse STT
  • June 16, 2026
  • June 15, 2026
  • May 30, 2026
  • May 30, 2026
  • May 28, 2026
  • May 22, 2026
  • May 15, 2026
  • May 8, 2026
  • May 6, 2026
  • May 6, 2026
  • May 4, 2026
  • May 3, 2026
  • May 1, 2026
  • May 1, 2026
  • May 1, 2026
  • April 30, 2026
  • April 21, 2026
  • April 20, 2026
  • Hydra
  • May 20, 2026

Pulse STT

June 16, 2026
June 16, 2026

June 15, 2026
June 15, 2026

May 30, 2026
May 30, 2026

May 30, 2026
May 30, 2026

May 28, 2026
May 28, 2026

May 22, 2026
May 22, 2026

May 15, 2026
May 15, 2026

May 8, 2026
May 8, 2026

May 6, 2026
May 6, 2026

May 6, 2026
May 6, 2026

Older posts

Next
Built with
Voice AgentsModels
Voice AgentsModels

Pulse STT — streaming now supports ja / yue / zh / ko + multi-asian aggregator (US region)

The Pulse streaming Speech-to-Text API now supports four Asian languages: Japanese (ja), Cantonese (yue), Mandarin (zh), and Korean (ko). The multi-asian aggregator is also available for unknown East Asian audio — it auto-detects across the same four-language set.

These five language values are served from the US region only — connect to wss://api.us.smallest.ai/waves/v1/stt/live?model=pulse (or the legacy wss://api.us.smallest.ai/waves/v1/pulse/get_text) instead of the default wss://api.smallest.ai/... host. Requesting any of them on the default (ap-south-1) host closes the connection without a transcription frame.

All other Pulse streaming parameters work as before: punctuate, capitalize, numerals, word_timestamps, diarize, redact_pii, redact_pci, sample rates 8000/16000/22050/24000/44100/48000, encodings linear16 (default) / linear32 / alaw / mulaw / opus / ogg_opus.

Pre-recorded (batch) is unchanged. Cantonese (yue) is not enabled on the batch endpoint at all; Japanese, Mandarin, and Korean behave separately from streaming and are not formally supported on batch — use the streaming endpoint for these four languages.

Example response frame:

1{
2 "type": "transcription",
3 "transcript": "讀書要從薄到厚再從厚到薄",
4 "is_final": true,
5 "is_last": false,
6 "language": "zh"
7}

→ Pulse model card — Supported Languages → Speech-to-Text overview

Streaming vs Pre-Recorded language framing across Pulse docs

The Pulse STT supported-languages story is now consistent end-to-end. Each surface labels its mode and points readers at the right neighbour:

  • Overview page (/waves/documentation/speech-to-text-pulse/overview): split the single Supported Languages table into two — Streaming (Real-Time, WebSocket) and Non-Streaming (Pre-Recorded, HTTP). Same 39-language set in both, matching the Pulse model card.
  • API Reference — Pre-Recorded (HTTP): language query param now declares an explicit enum of 39 single-language codes (en, hi, es, …) plus multi-eu / multi-indic aggregators. Description labels the endpoint as Pre-Recorded and points at the streaming WebSocket endpoint for the live mode.
  • API Reference — Streaming (WebSocket): same enum and labelling, applied via the AsyncAPI override so it renders correctly. The page-top Note labels it as Streaming and points at the HTTP endpoint for pre-recorded.

No protocol or wire-level change — purely a docs/spec consolidation.

Pulse STT WebSocket — finalize operation restored on the API reference

The Pulse STT WebSocket API reference now correctly renders both control-message operations on docs.smallest.ai:

  • sendFinalize — payload {"type":"finalize"} — flushes the current audio buffer and emits an is_final: true transcript while keeping the session open. Useful for per-turn finalization in agentic pipelines.
  • sendCloseStream — payload {"type":"close_stream"} — flushes any buffered audio, emits the terminal is_final: true + is_last: true transcript, then closes the session.

No wire change — the server has accepted both control messages since launch (StreamControlType.FINALIZE and STREAM_CONTROL_TYPE_END are sibling enum values in the internal gRPC schema, and the WS controller has explicit handlers for both parsed.type === "finalize" and parsed.type === "close_stream"). Only the rendered documentation was incomplete.

Root cause for the doc reviewers who want to know: the v4 docs override (fern/apis/waves-v4/overrides/pulse-stt-ws-overrides.yml) and the SDK override (fern/apis/waves/asyncapi/pulse-stt-ws-overrides.yml) both used unprefixed message keys (audioData.message, finalizeSignal.message, …) while the base spec used a pulse* prefix (pulseAudioData.message, pulseFinalizeSignal.message, …). Fern’s merge silently dropped one of the three send operations when it couldn’t resolve refs cleanly. Convention going forward: override message keys must be identical to the base spec’s keys — every other Waves spec layer (TTS WS, Lightning v3.1 WS) already follows this rule.

Migration: nothing for customers. The server-side contract has always allowed both control messages; this is purely a docs render fix.

Also clarified: the ITN feature page’s “Recommended Setup for Agentic Use Cases” section had been recommending close_stream per utterance, which is correct for single-shot transcription but misleading for the multi-turn voice agents the section is named after. Split the recommendation into two paths: finalize per user turn (session stays open, lowest latency between turns) vs close_stream at end of session (terminal). The Python example was split into two snippets that match these two patterns directly.

Also clarified — clearer signal framing: the finalize and close_stream control messages are now documented as a turn-boundary signal vs session-end signal on the API ref (operation summaries) and in the ITN feature page. The base-spec operation summaries used to read “Flush current audio buffer” / “End the audio stream” — accurate but not actionable. They now say what the signal does to the session lifecycle: keep listening vs hang up the socket.

CI lock-in: spec_drift_check.py was extended with an AsyncAPI override-key parity check. Any new override whose channels.<chan>.messages.<KEY> or operations.<KEY> doesn’t exist in the base will fail the gate. Existing deprecated-spec drift (Lightning v2, the legacy /streaming-tts/stream route) is allow-listed with a documented rationale and tracked separately. A new post-deploy smoke check (docs_render_smoke.py, wired into publish-docs.yml) re-fetches the rendered docs after every push to main and asserts every expected operation appears — the same check would have caught this exact bug the day it deployed.

STT WebSocket — sendFinalize added to unified /stt/live API reference + doc fixes

The unified Speech-to-Text WebSocket API reference page now documents both control messages with their correct semantics:

  • sendFinalize — turn-boundary signal. Flushes the current audio buffer, runs ITN, and emits one is_final: true transcript for that turn. The WebSocket stays open for the next user turn. Use this once per turn in any multi-turn flow.
  • sendClose — session-end signal. Flushes remaining audio, emits the terminal is_final: true + is_last: true transcript, then closes the WebSocket. Use this once at the end of the session.

No wire change. The server has accepted both control messages on this endpoint since launch (confirmed via live probe and the platform repo’s pulse.asr.ws.controller.ts which has explicit handlers for both parsed.type === "finalize" and parsed.type === "close_stream"). Only the spec and docs were incomplete.

ITN feature page updated to match. The “Python — WebSocket with ITN” and “JavaScript — WebSocket with ITN” examples now clearly indicate they show the single-shot pattern, with a pointer to the multi-turn voice-agent example for any flow where the same WebSocket handles multiple user turns. Sending close_stream per turn forces a WebSocket reconnect on every next turn — that’s a documented anti-pattern for voice agents and adds hundreds of ms of connection overhead per turn.

Spec-level cleanup:

  • Removed the spurious is_last: true field from the close_stream client payload schema. is_last is a server-emitted response field; it’s meaningless in the client-sent control message and the server ignores it.
  • Added from_finalize, transcript, full_transcript, language, and languages fields to the TranscriptionEvent response schema (they were missing from the spec despite being emitted by the server — verified against pulse.asr.schema.ts’s lightningAsrWebsocketResponseDtoSchema).

Migration: none for customers. Any code already on close_stream-per-turn will keep working but pays a reconnect cost on every turn. Switch to finalize per turn + close_stream once at end-of-session to drop turn-2-N latency by 200–800 ms.

Unified Speech-to-Text endpoint, Pulse Pro model

The Speech-to-Text API now lives at the unified path /waves/v1/stt/, mirroring the unified TTS shape. The model is selected via the ?model= query parameter. Two models are live today:

  • ?model=pulse: multilingual (17 streaming + 26 pre-recorded languages), HTTP + WebSocket streaming.
  • ?model=pulse-pro: leaderboard-ranked English STT (5.42% ESB avg WER, tied #2 on the public Open ASR Leaderboard). HTTP only.

Pulse Pro on the streaming endpoint (WS /waves/v1/stt/live?model=pulse-pro) is rejected with 400 before WebSocket upgrade because the streaming worker is not yet deployed. Use the HTTP endpoint and pass webhook_url for long files.

Customer pricing (Standard plan):

  • Pulse, streaming (WebSocket): $0.006 / minute
  • Pulse, non-streaming (HTTP): $0.0035 / minute
  • Pulse Pro, non-streaming (HTTP): $0.004 / minute

Standard plan rate limits default to 25 RPM per model and 100 concurrent WebSocket sessions. Enterprise is unlimited and configurable per-customer.

The existing endpoints (POST /waves/v1/pulse/get_text and WS /waves/v1/pulse/get_text) continue to work alongside the new unified path. New integrations are encouraged to use /waves/v1/stt/ since it carries both models behind one path.

  • Pulse Pro model card
  • Speech-to-Text quickstart covers both models.

Pulse STT — full feature parity over HTTP and WebSocket

The full Pulse STT feature set is now available on both HTTP (pre-recorded) and WebSocket (realtime) modes with consistent flag names and response shapes.

What this means in practice:

  • Same query parameters and request body fields work the same way on POST /waves/v1/pulse/get_text (pre-recorded) and wss://api.smallest.ai/waves/v1/pulse/get_text (realtime).
  • Speaker diarization, word timestamps, sentence-level utterances, emotion detection, gender detection, keyword boosting, redaction, punctuation, and inverse text normalisation all behave identically across modes.
  • The full per-feature documentation is on the Features pages.

Migration: no action — additive. Existing integrations on either mode continue working.

Pulse STT — numerals query parameter removed from the WS API reference

The legacy numerals query parameter on the Pulse STT WebSocket has been removed from the API reference, navigation, and feature pages. The 2026-05-06 deprecation note on this changelog feed flagged this for two weeks; today’s clean-up drops it from the documented surface area.

What changed in the docs

  • numerals removed from the Pulse WS AsyncAPI spec + v4 override.
  • The “Numeric formatting” feature page and its nav entry are gone.
  • The “Convert numerals” cards on realtime/features.mdx and pre-recorded/features.mdx are gone.
  • The ITN feature page no longer lists numerals in its precedence table.

What changed on the server

Nothing in this release. The numerals query parameter still works as before — historic callers passing numerals=true will continue to get the same behavior. This is purely a docs cleanup; future deprecation/removal at the server layer will get its own changelog entry when it happens.

What to use instead

For new integrations, pass itn_normalize=true on the WebSocket connection. ITN covers digits as well as dates, currencies, phone numbers, and other spoken-form entities, and gives more consistent results across languages. See the Inverse Text Normalization page for details.

Pulse STT — encoding query param now documented on the pre-recorded REST endpoint

The pre-recorded REST endpoint (POST /waves/v1/pulse/get_text) now lists the encoding query parameter alongside the streaming WebSocket. Same 6-value enum: linear16, linear32, alaw, mulaw, opus, ogg_opus. Default is linear16.

When omitted, the server falls back to detecting the format from the file’s container header (works for .wav, .mp3, .flac, .ogg, .m4a, .webm).

Pulse STT — age_detection removed from the pre-recorded HTTP API

The age_detection query parameter and the corresponding top-level age field in the response have been removed from the Pulse STT pre-recorded HTTP API (POST /waves/v1/pulse/get_text). Gender detection (gender_detection / gender) and emotion detection (emotion_detection / emotions) are unaffected.

Specs and reference docs updated:

  • fern/apis/waves/openapi/pulse-stt-openapi.yaml — age_detection query param dropped; age response field and example value removed.
  • fern/products/waves/pages/v4.0.0/api-references/pulse-stt.mdx (+ versions mirror) — cURL/Python/JavaScript samples for both raw-bytes and audio-URL methods no longer pass age_detection.
  • fern/products/waves/pages/v4.0.0/speech-to-text/pre-recorded/code-examples.mdx (+ versions mirror) — Python end-to-end sample no longer requests or prints age.
  • fern/products/waves/pages/v4.0.0/speech-to-text/features/age-and-gender-detection.mdx (+ versions mirror) — page retitled to Gender detection and trimmed to gender-only content. The file path is unchanged so existing /features/age-and-gender-detection links keep resolving.
  • fern/products/waves/pages/v4.0.0/integrations/n8n.mdx, speech-to-text/overview.mdx, speech-to-text/pre-recorded/features.mdx, speech-to-text/model-cards/pulse.mdx, and the STT benchmarks metrics-overview.mdx — surrounding tables, accordions, and feature cards updated to drop age references.
  • fern/products/waves/versions/v4.0.0.yml — sidebar entry retitled to Gender Detection.

If your code passes age_detection=true or reads response.age, drop both — the parameter is now ignored and the field will not be returned. No other Pulse STT request shape or response field changes.

→ Gender detection

Pulse STT — recommend itn_normalize over numerals for new integrations

The numerals query parameter on the Pulse STT WebSocket API still works and continues to behave as documented. For new integrations we now recommend itn_normalize=true instead — it covers digits as well as dates, currencies, phone numbers, and other spoken-form entities, and gives more consistent results across languages.

Existing code that uses numerals does not need to change.

→ Inverse Text Normalization