Inverse Text Normalization (ITN)
Inverse Text Normalization (ITN)
Inverse Text Normalization automatically converts spoken numbers, dates, currencies, and other entities into their written equivalents. When enabled, ITN runs as a post-processing step on every finalized transcript — no changes to your audio pipeline required.
Recommended Setup for Agentic Use Cases
Since ITN normalizes only finalized transcripts, it is recommended that you control finalization from your end for the best results:
-
Set
finalize_on_words=falseso the server does not finalize internally based on word count. -
Set
eou_timeout_msto match your VAD (Voice Activity Detection) silence threshold. -
When your VAD detects the user has stopped speaking, send the control message that matches the lifecycle of your session:
-
Turn-boundary signal —
{"type": "finalize"}. Flushes the current audio buffer, runs ITN over the full utterance, and emits oneis_final: truetranscript for that turn. The WebSocket stays open and accepts audio for the next turn. Send this once per user turn in any multi-turn flow. -
Session-end signal —
{"type": "close_stream"}. Flushes any remaining buffered audio, emits the terminalis_final: true+is_last: truetranscript, then closes the WebSocket. Send this once at the actual end of the session — end of call, app shutdown, or after the buffer of a single-shot transcription is fully streamed.
-
A multi-turn voice agent typically fires many finalize messages and exactly one close_stream per session. A one-off transcription of a fixed audio buffer fires only close_stream after the last chunk.
This approach gives you clean, fully-normalized utterances for downstream LLM processing without paying the WebSocket-reconnect cost between turns.
Enabling ITN
Pass itn_normalize=true as a query parameter when connecting:
ITN is disabled by default. When disabled, transcripts are returned in spoken form as usual.
Parameters
These parameters can be combined with itn_normalize to control transcription behavior:
Supported Semiotic Classes
ITN covers all standard semiotic classes:
Finalize Control
You can force-finalize the current utterance at any time by sending a JSON text message over the WebSocket:
This promotes all accumulated tokens to a final transcript immediately, without closing the stream. The stream stays open for subsequent audio.
This is especially useful with finalize_on_words=false, where automatic finalization is disabled and you control exactly when each utterance boundary occurs. Common use cases:
- Form filling — Finalize after each form field to get a clean ITN result per field
- Voice commands — Finalize on button release or keyword detection
- Turn-based conversations — Finalize when the other party starts speaking
The server responds with a final transcript for the pending chunk:
If there are no pending tokens when you send finalize, you still get a response (empty transcript) so your request/response indexing stays in sync.
End Stream
To close the stream and flush remaining audio, send:
This flushes any remaining audio, returns the final transcript with is_last: true, and closes the connection. { "type": "finalize" } does not close the stream — it only forces an immediate is_final=true transcript while keeping the session open.
Examples
The two examples below — Python WebSocket and JavaScript WebSocket — show the single-shot pattern (transcribe a fixed audio buffer, then close the session). For a multi-turn voice agent that handles many user turns on the same WebSocket, scroll down to the Python — Multi-turn voice agent example below — it sends {"type":"finalize"} per turn (session stays open) and {"type":"close_stream"} once at the end of the call. Sending close_stream per turn forces a WebSocket reconnect every turn — that’s the wrong pattern for voice agents.
Python — WebSocket with ITN (single-shot file transcription)
JavaScript — WebSocket with ITN (single-shot file transcription)
Python — Multi-turn voice agent (recommended for Voice AI)
For voice agents that handle many user turns in a single session, send {"type": "finalize"} after each turn. The WebSocket stays open and you pay the connection cost only once per call:
Python — Single-shot transcription
For one-off transcription of a complete audio buffer (file or single utterance) with no further audio coming, use close_stream directly. It flushes, normalizes, emits is_last: true, and closes — no extra round-trip:
Combining ITN with Other Features
ITN works alongside all other post-processing features:
Processing order: ITN → Profanity Filter → PII/PCI Redaction
Response Format
When ITN is enabled, final responses contain the normalized transcript:
Key behaviors:
- Word timestamps are remapped. When multiple spoken words collapse into one written token (e.g., “twenty five dollars” → “$25”), the output word spans the full time range of all source words and takes the max confidence.
- Punctuation is preserved. Periods, commas, and other punctuation from ASR output are stripped before ITN and reattached to the correct output token afterward.
- Interim responses are not normalized. ITN only runs on finalized transcripts (
is_final: true) to avoid unnecessary processing on text that may still change. - Capitalization is preserved. ITN runs in cased mode, so proper nouns and sentence-initial caps from the ASR model are maintained.
How It Works
- End-of-utterance detection — The ASR pipeline detects a natural pause or hits the
max_wordslimit, producing a finalized transcript. Or you send{"type": "finalize"}to force it. - Punctuation stripping — Trailing punctuation (
.,!?;:) is stripped from each word before normalization, since the underlying FST engine cannot parse through punctuation. - ITN normalization — The text is passed through a Weighted Finite State Transducer (WFST) that converts spoken-form entities to written form with word alignment tracking.
- Punctuation reattachment — Stripped punctuation is mapped back to the correct output token using the word alignment from step 3.
- Timestamp remapping — Word-level timestamps from ASR are remapped to the ITN output using the alignment, spanning collapsed words.

