Inverse Text Normalization (ITN)
Inverse Text Normalization automatically converts spoken numbers, dates, currencies, and other entities into their written equivalents. When enabled, ITN runs as a post-processing step on every finalized transcript — no changes to your audio pipeline required.
Recommended Setup for Agentic Use Cases
Since ITN normalizes only finalized transcripts, it is recommended that you control finalization from your end for the best results:
- Set
finalize_on_words=falseso the server does not finalize internally based on word count - Set
eou_timeout_msto match your VAD (Voice Activity Detection) silence threshold - When your VAD detects that the user has stopped speaking, send
{"type": "finalize"}— this finalizes the entire chunk and ITN normalizes it as a whole
This approach is especially useful for agentic use cases where you want clean, fully-normalized utterances for downstream LLM processing.
Enabling ITN
Pass itn_normalize=true as a query parameter when connecting:
ITN is disabled by default. When disabled, transcripts are returned in spoken form as usual.
Parameters
These parameters can be combined with itn_normalize to control transcription behavior:
Supported Semiotic Classes
ITN covers all standard semiotic classes:
Finalize Control
You can force-finalize the current utterance at any time by sending a JSON text message over the WebSocket:
This promotes all accumulated tokens to a final transcript immediately, without closing the stream. The stream stays open for subsequent audio.
This is especially useful with finalize_on_words=false, where automatic finalization is disabled and you control exactly when each utterance boundary occurs. Common use cases:
- Form filling — Finalize after each form field to get a clean ITN result per field
- Voice commands — Finalize on button release or keyword detection
- Turn-based conversations — Finalize when the other party starts speaking
The server responds with a final transcript that has from_finalize: true:
If there are no pending tokens when you send finalize, you still get a response (empty transcript with from_finalize: true) so your request/response indexing stays in sync.
End Stream
To close the stream and flush remaining audio, send:
This flushes any remaining audio, returns the final transcript with is_last: true, and closes the connection.
Examples
Python — WebSocket with ITN
JavaScript — WebSocket with ITN
Python — Agentic Setup (Recommended for Voice AI)
Disable internal word-count finalization, set eou_timeout_ms to match your VAD, and send {"type": "finalize"} when your VAD detects end-of-speech. This finalizes the entire chunk and ITN normalizes it as a whole:
Combining ITN with Other Features
ITN works alongside all other post-processing features:
Processing order: ITN → Numerals → Profanity Filter → PII/PCI Redaction
Response Format
When ITN is enabled, final responses contain the normalized transcript:
Key behaviors:
- Word timestamps are remapped. When multiple spoken words collapse into one written token (e.g., “twenty five dollars” → “$25”), the output word spans the full time range of all source words and takes the max confidence.
- Punctuation is preserved. Periods, commas, and other punctuation from ASR output are stripped before ITN and reattached to the correct output token afterward.
- Interim responses are not normalized. ITN only runs on finalized transcripts (
is_final: true) to avoid unnecessary processing on text that may still change. - Capitalization is preserved. ITN runs in cased mode, so proper nouns and sentence-initial caps from the ASR model are maintained.
How It Works
- End-of-utterance detection — The ASR pipeline detects a natural pause or hits the
max_wordslimit, producing a finalized transcript. Or you send{"type": "finalize"}to force it. - Punctuation stripping — Trailing punctuation (
.,!?;:) is stripped from each word before normalization, since the underlying FST engine cannot parse through punctuation. - ITN normalization — The text is passed through a Weighted Finite State Transducer (WFST) that converts spoken-form entities to written form with word alignment tracking.
- Punctuation reattachment — Stripped punctuation is mapped back to the correct output token using the word alignment from step 3.
- Timestamp remapping — Word-level timestamps from ASR are remapped to the ITN output using the alignment, spanning collapsed words.

