Inverse Text Normalization (ITN)

View as Markdown
Real-Time

Inverse Text Normalization automatically converts spoken numbers, dates, currencies, and other entities into their written equivalents. When enabled, ITN runs as a post-processing step on every finalized transcript — no changes to your audio pipeline required.

Spoken (ASR output)Written (with ITN)
“the total is twenty five dollars""the total is $25"
"call me at nine one zero five five five twelve thirty four""call me at 910-555-1234"
"the meeting is on january fifteenth twenty twenty six""the meeting is on January 15th, 2026"
"it costs three point five percent""it costs 3.5%"
"send it to john at gmail dot com""send it to john@gmail.com"
"i live at one two three main street""i live at 123 Main Street”

Since ITN normalizes only finalized transcripts, it is recommended that you control finalization from your end for the best results:

  1. Set finalize_on_words=false so the server does not finalize internally based on word count.

  2. Set eou_timeout_ms to match your VAD (Voice Activity Detection) silence threshold.

  3. When your VAD detects the user has stopped speaking, send the control message that matches the lifecycle of your session:

    • Turn-boundary signal — {"type": "finalize"}. Flushes the current audio buffer, runs ITN over the full utterance, and emits one is_final: true transcript for that turn. The WebSocket stays open and accepts audio for the next turn. Send this once per user turn in any multi-turn flow.

    • Session-end signal — {"type": "close_stream"}. Flushes any remaining buffered audio, emits the terminal is_final: true + is_last: true transcript, then closes the WebSocket. Send this once at the actual end of the session — end of call, app shutdown, or after the buffer of a single-shot transcription is fully streamed.

A multi-turn voice agent typically fires many finalize messages and exactly one close_stream per session. A one-off transcription of a fixed audio buffer fires only close_stream after the last chunk.

This approach gives you clean, fully-normalized utterances for downstream LLM processing without paying the WebSocket-reconnect cost between turns.

Enabling ITN

Pass itn_normalize=true as a query parameter when connecting:

wss://api.smallest.ai/waves/v1/stt/live?model=pulse&language=en&itn_normalize=true

ITN is disabled by default. When disabled, transcripts are returned in spoken form as usual.

Parameters

These parameters can be combined with itn_normalize to control transcription behavior:

ParameterTypeDefaultDescription
itn_normalizebooleanfalseEnable inverse text normalization
max_wordsintegerpipeline defaultMax words before forced finalization. Useful for keeping ITN chunks short and accurate
eou_timeout_msinteger800End-of-utterance silence timeout in milliseconds. Lower values finalize faster
finalize_on_wordsbooleantrueWhen false, disables automatic word-count-based finalization. Use this when you want full control over when to finalize via the finalize message
word_timestampsbooleanfalseInclude per-word timestamps. ITN remaps timestamps when words collapse (e.g., “one two three” → “123” spans all three source timestamps)

Supported Semiotic Classes

ITN covers all standard semiotic classes:

ClassExample (spoken → written)
Cardinal”one hundred twenty three” → “123”
Ordinal”twenty first” → “21st”
Money”twenty five dollars” → “$25”
Telephone”nine one zero five five five one two three four” → “910-555-1234”
Date”january fifteenth twenty twenty six” → “January 15th, 2026”
Time”three thirty p m” → “3:30 PM”
Decimal”three point one four” → “3.14”
Measure”five kilograms” → “5 kg”
Electronic”john at gmail dot com” → “john@gmail.com
Address”one two three main street” → “123 Main Street”
Verbatim”a b c” → “ABC”

Finalize Control

You can force-finalize the current utterance at any time by sending a JSON text message over the WebSocket:

1{ "type": "finalize" }

This promotes all accumulated tokens to a final transcript immediately, without closing the stream. The stream stays open for subsequent audio.

This is especially useful with finalize_on_words=false, where automatic finalization is disabled and you control exactly when each utterance boundary occurs. Common use cases:

  • Form filling — Finalize after each form field to get a clean ITN result per field
  • Voice commands — Finalize on button release or keyword detection
  • Turn-based conversations — Finalize when the other party starts speaking

The server responds with a final transcript for the pending chunk:

1{
2 "session_id": "sess_abc123",
3 "transcript": "the total is $25.",
4 "is_final": true,
5 "is_last": false,
6 "words": [...]
7}

If there are no pending tokens when you send finalize, you still get a response (empty transcript) so your request/response indexing stays in sync.

End Stream

To close the stream and flush remaining audio, send:

1{ "type": "close_stream" }

This flushes any remaining audio, returns the final transcript with is_last: true, and closes the connection. { "type": "finalize" } does not close the stream — it only forces an immediate is_final=true transcript while keeping the session open.

Examples

The two examples below — Python WebSocket and JavaScript WebSocket — show the single-shot pattern (transcribe a fixed audio buffer, then close the session). For a multi-turn voice agent that handles many user turns on the same WebSocket, scroll down to the Python — Multi-turn voice agent example below — it sends {"type":"finalize"} per turn (session stays open) and {"type":"close_stream"} once at the end of the call. Sending close_stream per turn forces a WebSocket reconnect every turn — that’s the wrong pattern for voice agents.

Python — WebSocket with ITN (single-shot file transcription)

1import asyncio
2import websockets
3import json
4from urllib.parse import urlencode
5
6BASE_WS_URL = "wss://api.smallest.ai/waves/v1/stt/live?model=pulse"
7params = {
8 "language": "en",
9 "encoding": "linear16",
10 "sample_rate": "16000",
11 "word_timestamps": "true",
12 "itn_normalize": "true",
13}
14WS_URL = f"{BASE_WS_URL}&{urlencode(params)}"
15
16API_KEY = "YOUR_API_KEY"
17
18async def transcribe_file(audio_file: str):
19 """Single-shot: stream a fixed audio buffer, then close the session.
20 For multi-turn voice agents, use the Multi-turn voice agent example
21 further down — it keeps the WebSocket open across user turns.
22 """
23 headers = {"Authorization": f"Bearer {API_KEY}"}
24
25 async with websockets.connect(WS_URL, additional_headers=headers) as ws:
26 # Stream the file in 4096-byte chunks
27 with open(audio_file, "rb") as f:
28 while chunk := f.read(4096):
29 await ws.send(chunk)
30
31 # All audio sent. For one-off transcription with no more audio coming,
32 # send close_stream — it flushes, runs ITN over the full buffer, emits
33 # is_final + is_last, and closes the WebSocket.
34 await ws.send(json.dumps({"type": "close_stream"}))
35
36 async for message in ws:
37 data = json.loads(message)
38 if data.get("is_final"):
39 print(f"Final: {data['transcript']}")
40 # With ITN: "the total is $25."
41 # Without: "the total is twenty five dollars."
42 else:
43 print(f"Interim: {data['transcript']}")
44
45 if data.get("is_last"):
46 break
47
48asyncio.run(transcribe_file("audio.wav"))

JavaScript — WebSocket with ITN (single-shot file transcription)

1// Single-shot pattern. For a multi-turn voice agent, mirror the
2// Python "Multi-turn voice agent" example below: send {"type":"finalize"}
3// per user turn (session stays open) and {"type":"close_stream"} once at
4// the end of the call.
5
6const API_KEY = "YOUR_API_KEY";
7
8const url = new URL("wss://api.smallest.ai/waves/v1/stt/live?model=pulse");
9url.searchParams.append("language", "en");
10url.searchParams.append("encoding", "linear16");
11url.searchParams.append("sample_rate", "16000");
12url.searchParams.append("word_timestamps", "true");
13url.searchParams.append("itn_normalize", "true");
14
15const ws = new WebSocket(url.toString(), {
16 headers: { Authorization: `Bearer ${API_KEY}` },
17});
18
19ws.onopen = () => {
20 console.log("Connected — streaming audio with ITN enabled");
21 // ... send audio chunks as binary messages ...
22 // When the buffer is fully streamed and no more audio is coming:
23 // ws.send(JSON.stringify({ type: "close_stream" }));
24};
25
26ws.onmessage = (event) => {
27 const data = JSON.parse(event.data);
28
29 if (data.is_final) {
30 console.log("Final:", data.transcript);
31 // "call me at 910-555-1234"
32 } else {
33 console.log("Interim:", data.transcript);
34 }
35
36 if (data.is_last) {
37 ws.close();
38 }
39};

For voice agents that handle many user turns in a single session, send {"type": "finalize"} after each turn. The WebSocket stays open and you pay the connection cost only once per call:

1params = {
2 "language": "en",
3 "encoding": "linear16",
4 "sample_rate": "16000",
5 "itn_normalize": "true",
6 "finalize_on_words": "false", # Disable internal word-count finalization
7 "eou_timeout_ms": "600", # Match your VAD silence threshold
8 "word_timestamps": "true",
9}
10WS_URL = f"{BASE_WS_URL}&{urlencode(params)}"
11
12async def run_voice_agent():
13 headers = {"Authorization": f"Bearer {API_KEY}"}
14
15 async with websockets.connect(WS_URL, additional_headers=headers) as ws:
16 # Audio producer — stream mic frames as the user speaks across many turns
17 async def stream_audio():
18 while session_active:
19 frame = await mic_queue.get()
20 await ws.send(frame)
21
22 # Per-turn finalizer — your VAD calls this when the user pauses
23 async def on_user_end_of_turn():
24 await ws.send(json.dumps({"type": "finalize"}))
25 # Server emits is_final: true with the full ITN-normalized transcript;
26 # the WS stays open so the user can speak again.
27
28 # Consumer — receive transcripts (each turn yields one is_final: true frame)
29 async def consume():
30 async for message in ws:
31 data = json.loads(message)
32 if data.get("is_final"):
33 print(f"Turn final: {data['transcript']}")
34 # → hand to your LLM, generate the agent reply, speak via TTS,
35 # and loop back to listening for the next user turn
36 if data.get("is_last"):
37 break # only fires after close_stream below
38
39 # ... wire stream_audio / consume / VAD-driven on_user_end_of_turn together ...
40
41 # End of call (user hung up, agent finished, etc.) — close the session
42 await ws.send(json.dumps({"type": "close_stream"}))

Python — Single-shot transcription

For one-off transcription of a complete audio buffer (file or single utterance) with no further audio coming, use close_stream directly. It flushes, normalizes, emits is_last: true, and closes — no extra round-trip:

1async def transcribe_once(audio_file: str):
2 headers = {"Authorization": f"Bearer {API_KEY}"}
3
4 async with websockets.connect(WS_URL, additional_headers=headers) as ws:
5 with open(audio_file, "rb") as f:
6 while chunk := f.read(4096):
7 await ws.send(chunk)
8
9 # All audio sent → close_stream is the only message that emits is_last=true
10 await ws.send(json.dumps({"type": "close_stream"}))
11
12 async for message in ws:
13 data = json.loads(message)
14 if data.get("is_final"):
15 print(f"Final: {data['transcript']}")
16 if data.get("is_last"):
17 break

Combining ITN with Other Features

ITN works alongside all other post-processing features:

1params = {
2 "language": "en",
3 "encoding": "linear16",
4 "sample_rate": "16000",
5 "itn_normalize": "true",
6 "redact_pii": "true", # Redact names, SSN, emails, phone numbers
7 "diarize": "true", # Speaker diarization
8 "word_timestamps": "true",
9}

Processing order: ITN → Profanity Filter → PII/PCI Redaction

Response Format

When ITN is enabled, final responses contain the normalized transcript:

1{
2 "session_id": "sess_abc123",
3 "transcript": "the total is $25.",
4 "is_final": true,
5 "is_last": false,
6 "language": "en",
7 "words": [
8 { "word": "the", "start": 0.48, "end": 0.56, "confidence": 0.98 },
9 { "word": "total", "start": 0.56, "end": 0.80, "confidence": 0.97 },
10 { "word": "is", "start": 0.80, "end": 0.96, "confidence": 0.99 },
11 { "word": "$25.", "start": 0.96, "end": 1.44, "confidence": 0.95 }
12 ]
13}

Key behaviors:

  • Word timestamps are remapped. When multiple spoken words collapse into one written token (e.g., “twenty five dollars” → “$25”), the output word spans the full time range of all source words and takes the max confidence.
  • Punctuation is preserved. Periods, commas, and other punctuation from ASR output are stripped before ITN and reattached to the correct output token afterward.
  • Interim responses are not normalized. ITN only runs on finalized transcripts (is_final: true) to avoid unnecessary processing on text that may still change.
  • Capitalization is preserved. ITN runs in cased mode, so proper nouns and sentence-initial caps from the ASR model are maintained.

How It Works

  1. End-of-utterance detection — The ASR pipeline detects a natural pause or hits the max_words limit, producing a finalized transcript. Or you send {"type": "finalize"} to force it.
  2. Punctuation stripping — Trailing punctuation (. , ! ? ; :) is stripped from each word before normalization, since the underlying FST engine cannot parse through punctuation.
  3. ITN normalization — The text is passed through a Weighted Finite State Transducer (WFST) that converts spoken-form entities to written form with word alignment tracking.
  4. Punctuation reattachment — Stripped punctuation is mapped back to the correct output token using the word alignment from step 3.
  5. Timestamp remapping — Word-level timestamps from ASR are remapped to the ITN output using the alignment, spanning collapsed words.