Inverse Text Normalization (ITN)

View as Markdown
Real-Time

Inverse Text Normalization automatically converts spoken numbers, dates, currencies, and other entities into their written equivalents. When enabled, ITN runs as a post-processing step on every finalized transcript — no changes to your audio pipeline required.

Spoken (ASR output)Written (with ITN)
“the total is twenty five dollars""the total is $25"
"call me at nine one zero five five five twelve thirty four""call me at 910-555-1234"
"the meeting is on january fifteenth twenty twenty six""the meeting is on January 15th, 2026"
"it costs three point five percent""it costs 3.5%"
"send it to john at gmail dot com""send it to john@gmail.com"
"i live at one two three main street""i live at 123 Main Street”

Since ITN normalizes only finalized transcripts, it is recommended that you control finalization from your end for the best results:

  1. Set finalize_on_words=false so the server does not finalize internally based on word count
  2. Set eou_timeout_ms to match your VAD (Voice Activity Detection) silence threshold
  3. When your VAD detects that the user has stopped speaking, send {"type": "finalize"} — this finalizes the entire chunk and ITN normalizes it as a whole

This approach is especially useful for agentic use cases where you want clean, fully-normalized utterances for downstream LLM processing.

Enabling ITN

Pass itn_normalize=true as a query parameter when connecting:

wss://api.smallest.ai/waves/v1/pulse/get_text?language=en&itn_normalize=true

ITN is disabled by default. When disabled, transcripts are returned in spoken form as usual.

Parameters

These parameters can be combined with itn_normalize to control transcription behavior:

ParameterTypeDefaultDescription
itn_normalizebooleanfalseEnable inverse text normalization
max_wordsintegerpipeline defaultMax words before forced finalization. Useful for keeping ITN chunks short and accurate
eou_timeout_msinteger800End-of-utterance silence timeout in milliseconds. Lower values finalize faster
finalize_on_wordsbooleantrueWhen false, disables automatic word-count-based finalization. Use this when you want full control over when to finalize via the finalize message
word_timestampsbooleanfalseInclude per-word timestamps. ITN remaps timestamps when words collapse (e.g., “one two three” → “123” spans all three source timestamps)
numeralsstring"auto"Digit formatting. When ITN is enabled, this is typically left as "auto" since ITN handles number conversion

Supported Semiotic Classes

ITN covers all standard semiotic classes:

ClassExample (spoken → written)
Cardinal”one hundred twenty three” → “123”
Ordinal”twenty first” → “21st”
Money”twenty five dollars” → “$25”
Telephone”nine one zero five five five one two three four” → “910-555-1234”
Date”january fifteenth twenty twenty six” → “January 15th, 2026”
Time”three thirty p m” → “3:30 PM”
Decimal”three point one four” → “3.14”
Measure”five kilograms” → “5 kg”
Electronic”john at gmail dot com” → “john@gmail.com
Address”one two three main street” → “123 Main Street”
Verbatim”a b c” → “ABC”

Finalize Control

You can force-finalize the current utterance at any time by sending a JSON text message over the WebSocket:

1{ "type": "finalize" }

This promotes all accumulated tokens to a final transcript immediately, without closing the stream. The stream stays open for subsequent audio.

This is especially useful with finalize_on_words=false, where automatic finalization is disabled and you control exactly when each utterance boundary occurs. Common use cases:

  • Form filling — Finalize after each form field to get a clean ITN result per field
  • Voice commands — Finalize on button release or keyword detection
  • Turn-based conversations — Finalize when the other party starts speaking

The server responds with a final transcript that has from_finalize: true:

1{
2 "session_id": "sess_abc123",
3 "transcript": "the total is $25.",
4 "is_final": true,
5 "is_last": false,
6 "from_finalize": true,
7 "words": [...],
8 "full_transcript": "the total is $25."
9}

If there are no pending tokens when you send finalize, you still get a response (empty transcript with from_finalize: true) so your request/response indexing stays in sync.

End Stream

To close the stream and flush remaining audio, send:

1{ "type": "finalize" }

This flushes any remaining audio, returns the final transcript with is_last: true, and closes the connection.

Examples

Python — WebSocket with ITN

1import asyncio
2import websockets
3import json
4from urllib.parse import urlencode
5
6BASE_WS_URL = "wss://api.smallest.ai/waves/v1/pulse/get_text"
7params = {
8 "language": "en",
9 "encoding": "linear16",
10 "sample_rate": "16000",
11 "word_timestamps": "true",
12 "itn_normalize": "true",
13}
14WS_URL = f"{BASE_WS_URL}?{urlencode(params)}"
15
16API_KEY = "YOUR_API_KEY"
17
18async def transcribe(audio_file: str):
19 headers = {"Authorization": f"Bearer {API_KEY}"}
20
21 async with websockets.connect(WS_URL, additional_headers=headers) as ws:
22 # Stream audio in 4096-byte chunks
23 with open(audio_file, "rb") as f:
24 while chunk := f.read(4096):
25 await ws.send(chunk)
26
27 # Signal end of audio
28 await ws.send(json.dumps({"type": "finalize"}))
29
30 # Receive transcriptions
31 async for message in ws:
32 data = json.loads(message)
33 if data.get("is_final"):
34 print(f"Final: {data['transcript']}")
35 # With ITN: "the total is $25."
36 # Without: "the total is twenty five dollars."
37 else:
38 print(f"Interim: {data['transcript']}")
39
40 if data.get("is_last"):
41 break
42
43asyncio.run(transcribe("audio.wav"))

JavaScript — WebSocket with ITN

1const API_KEY = "YOUR_API_KEY";
2
3const url = new URL("wss://api.smallest.ai/waves/v1/pulse/get_text");
4url.searchParams.append("language", "en");
5url.searchParams.append("encoding", "linear16");
6url.searchParams.append("sample_rate", "16000");
7url.searchParams.append("word_timestamps", "true");
8url.searchParams.append("itn_normalize", "true");
9
10const ws = new WebSocket(url.toString(), {
11 headers: { Authorization: `Bearer ${API_KEY}` },
12});
13
14ws.onopen = () => {
15 console.log("Connected — streaming audio with ITN enabled");
16 // Start sending audio chunks as binary messages
17};
18
19ws.onmessage = (event) => {
20 const data = JSON.parse(event.data);
21
22 if (data.is_final) {
23 console.log("Final:", data.transcript);
24 // "call me at 910-555-1234"
25 } else {
26 console.log("Interim:", data.transcript);
27 }
28
29 if (data.is_last) {
30 ws.close();
31 }
32};

Disable internal word-count finalization, set eou_timeout_ms to match your VAD, and send {"type": "finalize"} when your VAD detects end-of-speech. This finalizes the entire chunk and ITN normalizes it as a whole:

1params = {
2 "language": "en",
3 "encoding": "linear16",
4 "sample_rate": "16000",
5 "itn_normalize": "true",
6 "finalize_on_words": "false", # Disable internal word-count finalization
7 "eou_timeout_ms": "600", # Match your VAD silence threshold
8 "word_timestamps": "true",
9}
10WS_URL = f"{BASE_WS_URL}?{urlencode(params)}"
11
12async def transcribe_agentic(audio_file: str):
13 headers = {"Authorization": f"Bearer {API_KEY}"}
14
15 async with websockets.connect(WS_URL, additional_headers=headers) as ws:
16 with open(audio_file, "rb") as f:
17 while chunk := f.read(4096):
18 await ws.send(chunk)
19
20 # VAD detected end-of-speech → send finalize
21 # ITN normalizes the entire accumulated chunk
22 await ws.send(json.dumps({"type": "finalize"}))
23
24 async for message in ws:
25 data = json.loads(message)
26 if data.get("is_final"):
27 print(f"Final: {data['transcript']}")
28 if data.get("is_last"):
29 break

Combining ITN with Other Features

ITN works alongside all other post-processing features:

1params = {
2 "language": "en",
3 "encoding": "linear16",
4 "sample_rate": "16000",
5 "itn_normalize": "true",
6 "redact_pii": "true", # Redact names, SSN, emails, phone numbers
7 "diarize": "true", # Speaker diarization
8 "word_timestamps": "true",
9}

Processing order: ITN → Numerals → Profanity Filter → PII/PCI Redaction

Response Format

When ITN is enabled, final responses contain the normalized transcript:

1{
2 "session_id": "sess_abc123",
3 "transcript": "the total is $25.",
4 "is_final": true,
5 "is_last": false,
6 "language": "en",
7 "words": [
8 { "word": "the", "start": 0.48, "end": 0.56, "confidence": 0.98 },
9 { "word": "total", "start": 0.56, "end": 0.80, "confidence": 0.97 },
10 { "word": "is", "start": 0.80, "end": 0.96, "confidence": 0.99 },
11 { "word": "$25.", "start": 0.96, "end": 1.44, "confidence": 0.95 }
12 ],
13 "full_transcript": "the total is $25."
14}

Key behaviors:

  • Word timestamps are remapped. When multiple spoken words collapse into one written token (e.g., “twenty five dollars” → “$25”), the output word spans the full time range of all source words and takes the max confidence.
  • Punctuation is preserved. Periods, commas, and other punctuation from ASR output are stripped before ITN and reattached to the correct output token afterward.
  • Interim responses are not normalized. ITN only runs on finalized transcripts (is_final: true) to avoid unnecessary processing on text that may still change.
  • Capitalization is preserved. ITN runs in cased mode, so proper nouns and sentence-initial caps from the ASR model are maintained.

How It Works

  1. End-of-utterance detection — The ASR pipeline detects a natural pause or hits the max_words limit, producing a finalized transcript. Or you send {"type": "finalize"} to force it.
  2. Punctuation stripping — Trailing punctuation (. , ! ? ; :) is stripped from each word before normalization, since the underlying FST engine cannot parse through punctuation.
  3. ITN normalization — The text is passed through a Weighted Finite State Transducer (WFST) that converts spoken-form entities to written form with word alignment tracking.
  4. Punctuation reattachment — Stripped punctuation is mapped back to the correct output token using the word alignment from step 3.
  5. Timestamp remapping — Word-level timestamps from ASR are remapped to the ITN output using the alignment, spanning collapsed words.