For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Lightning v3.1 Pro is a premium 44.1 kHz text-to-speech pool with improved naturalness and a curated voice catalog. Runs on dedicated inference capacity, isolated from general traffic. Concurrency, latency, and rate limits are identical to standard Lightning v3.1; the difference is voice quality and the catalog.
44.1 kHz
Native sample rate
200ms
TTFB at 40 concurrent requests
English + Hindi
Indian voices code-switch; British and American voices English-only
3.3x
Real-time factor (faster than playback)
Model Overview
Developed by
Smallest AI
Model type
Text-to-Speech / Speech Synthesis
Languages
English (en), Hindi (hi), auto
License
Proprietary
Version
v3.1 Pro
Model ID
lightning_v3.1_pro
Native sample rate
44,100 Hz
Key Capabilities
Real-Time Optimized
Ultra-low latency architecture designed for conversational AI and live streaming.
Streaming
HTTP, SSE, and WebSocket support for real-time applications.
Multi-Language
Indian voices speak English + Hindi with automatic code-switching. British and American voices speak English.
High Fidelity
Broadcast-quality 44.1 kHz audio with natural prosody, intonation, and conversational rhythm.
Curated Voice Catalog
Premium voices across American, British, and Indian accents.
Pronunciation Control
Custom pronunciation dictionaries for specialized vocabulary, brand names, and domain-specific terms.
How to use it
Pro is selected via the model body parameter on the unified TTS routes — no separate endpoint to call.
$
curl -X POST "https://api.smallest.ai/waves/v1/tts" \
>
-H "Authorization: Bearer $SMALLEST_API_KEY" \
>
-H "Content-Type: application/json" \
>
-H "Accept: audio/wav" \
>
-d '{
>
"text": "Hello from the Lightning v3.1 Pro pool.",
>
"voice_id": "meher",
>
"model": "lightning_v3.1_pro",
>
"language": "en",
>
"sample_rate": 24000,
>
"output_format": "wav"
>
}' --output speech.wav
The same "model": "lightning_v3.1_pro" body field also routes to the Pro pool on the WebSocket and SSE endpoints.
On Atoms voice agents, open the agent’s voice picker and pick a Pro voice from the Pro filter chip. Atoms transparently routes to the Pro pool — no code change required.
Performance & Benchmarks
Pro improves on standard Lightning v3.1 across accuracy, expressiveness, delivery, and MOS quality. Tables below pair Pro with the same competitor set documented on the Lightning v3.1 model card; refer to that card for Pro-vs-Standard comparisons. Open the accordion under each category to see what each metric measures.
Naturalness — higher is better
Metric
Lightning v3.1 Pro
GPT-4o-mini
ElevenLabs Turbo v2.5
ElevenLabs Multilingual v2
Sonic-3
Gemini 2.5 Pro
Gemini 2.5 Flash
MAI-Voice-1
Inworld 1.5
S2 Pro
Overall
3.16
3.13
3.16
3.17
3.20
3.07
3.28
3.17
3.06
3.02
Naturalness
2.55
2.41
2.52
2.55
2.57
2.42
2.58
2.57
2.41
2.37
Intonation
3.06
3.06
3.07
3.06
3.12
2.90
3.28
3.04
2.91
2.86
Prosody
2.81
2.73
2.82
2.86
2.83
2.65
3.09
2.76
2.61
2.58
What each Naturalness metric measures
Overall — Holistic listener rating of how natural the voice sounds end-to-end.
Naturalness — How human-like the voice sounds; penalizes robotic or synthetic quality.
Intonation — Whether pitch rises and falls appropriately for the sentence type (question, statement, exclamation).
Prosody — The broader umbrella of rhythm, stress, and melody, how well the voice “reads” the sentence as a human would.
Expressiveness — higher is better
Metric
Lightning v3.1 Pro
GPT-4o-mini
ElevenLabs Turbo v2.5
ElevenLabs Multilingual v2
Sonic-3
Gemini 2.5 Pro
Gemini 2.5 Flash
MAI-Voice-1
Inworld 1.5
S2 Pro
Overall
3.55
3.45
3.44
3.46
3.38
3.49
3.54
3.50
3.37
3.41
Paralinguistics
3.64
3.60
3.59
3.61
3.56
3.60
3.64
3.58
3.55
3.58
Emotions
3.47
3.30
3.28
3.31
3.19
3.38
3.44
3.41
3.19
3.23
What each Expressiveness metric measures
Overall — Holistic listener rating of how expressive the voice sounds given the context of the sentence.
Paralinguistics — Non-verbal vocal elements like laughter, sighs, or filler sounds (“um”, “uh”) and whether they’re rendered appropriately.
Emotions — How accurately the voice conveys the intended emotional tone (neutral, warm, urgent, etc.).
Delivery — higher is better
Metric
Lightning v3.1 Pro
GPT-4o-mini
ElevenLabs Turbo v2.5
ElevenLabs Multilingual v2
Sonic-3
Gemini 2.5 Pro
Gemini 2.5 Flash
MAI-Voice-1
Inworld 1.5
S2 Pro
Boundary Consistency
4.96
4.94
4.93
4.95
4.93
4.88
4.99
4.77
4.90
4.88
Pronunciation Style
4.98
4.96
4.95
4.96
4.96
4.93
4.99
4.91
4.94
4.89
Natural Pace
4.72
4.57
4.51
4.51
4.01
4.23
4.66
4.47
4.33
3.74
Pause Placement
4.66
4.54
4.49
4.51
4.28
4.34
4.59
4.41
4.38
4.09
Breathing Naturalness
3.82
3.06
3.14
3.14
2.79
2.88
3.43
3.28
2.77
2.42
What each Delivery metric measures
Boundary Consistency — Whether phrase and sentence boundaries are marked consistently with pauses or pitch shifts, without arbitrary breaks mid-phrase.
Pronunciation Style — Not just correctness, but stylistic choices i.e., formal vs. casual register, regional accent consistency, honorific handling.
Natural Pace — Whether the speaking rate feels comfortable and appropriate for the content type, neither rushed nor dragging.
Pause Placement — Whether silences appear at semantically correct points (after commas, between clauses) rather than mid-word or mid-phrase.
Breathing Naturalness — Whether breath sounds occur at realistic points and with realistic frequency, not absent entirely or inserted randomly.
Accuracy
Mixed direction — WER, CER, Hallucination, and Deletion are lower is better; Pronunciation % is higher is better.
Whisper jiwer
Metric
Direction
Lightning v3.1 Pro
GPT-4o-mini
ElevenLabs Turbo v2.5
ElevenLabs Multilingual v2
Sonic-3
Gemini 2.5 Pro
Gemini 2.5 Flash
MAI-Voice-1
Inworld 1.5
S2 Pro
WER
lower
1.36%
1.26%
1.35%
1.33%
1.43%
1.26%
1.37%
1.25%
1.10%
2.83%
CER
lower
0.40%
0.52%
0.60%
0.54%
0.59%
0.62%
0.61%
0.50%
0.47%
1.16%
Hallucination
lower
0.00%
0.07%
0.08%
0.01%
0.06%
0.04%
0.01%
0.06%
0.00%
0.22%
Deletion
lower
0.00%
0.14%
0.17%
0.18%
0.16%
0.24%
0.18%
0.15%
0.12%
0.33%
Pronunciation % Whisper jiwer
higher
98.68%
98.94%
98.90%
98.87%
98.79%
99.02%
98.82%
98.95%
99.02%
97.72%
Whisper LLM
Metric
Direction
Lightning v3.1 Pro
GPT-4o-mini
ElevenLabs Turbo v2.5
ElevenLabs Multilingual v2
Sonic-3
Gemini 2.5 Pro
Gemini 2.5 Flash
MAI-Voice-1
Inworld 1.5
S2 Pro
WER
lower
0.96%
0.82%
0.72%
0.57%
0.88%
0.70%
0.72%
0.60%
0.55%
2.15%
CER
lower
0.34%
0.30%
0.28%
0.21%
0.30%
0.35%
0.33%
0.23%
0.18%
1.03%
Hallucination
lower
0.00%
0.07%
0.07%
0.00%
0.02%
0.02%
0.01%
0.03%
0.00%
0.10%
Pronunciation % Whisper LLM
higher
99.04%
99.25%
99.35%
99.43%
99.14%
99.32%
99.29%
99.43%
99.45%
97.95%
What each Accuracy metric measures
WER (Word Error Rate) — Percentage of words in the transcript that differ from the reference; measures how faithfully the TTS renders the input text.
CER (Character Error Rate) — Like WER but at the character level.
Hallucination — Words or sounds the TTS generates that have no basis in the input text. Insertions, substitutions, or fabricated content.
Deletion — Words from the reference text that the TTS dropped entirely.
Pronunciation % — The proportion of words pronounced correctly out of total words.
Whisper jiwer vs Whisper LLM — Two judging methodologies. jiwer uses raw Whisper-decoded transcripts; LLM-judged uses a follow-on LLM to normalize transcription noise. Both report the same metric family; LLM-judged tends to give lower error rates by reducing false positives from punctuation/casing.
MOS v2 — higher is better
Metric
Lightning v3.1 Pro
GPT-4o-mini
ElevenLabs Turbo v2.5
ElevenLabs Multilingual v2
Sonic-3
Gemini 2.5 Pro
Gemini 2.5 Flash
MAI-Voice-1
Inworld 1.5
S2 Pro
Mean MOS
4.22
4.16
3.98
4.02
3.76
4.11
4.24
3.97
3.73
3.99
UTMOS
3.76
3.76
3.37
3.41
2.77
3.57
3.71
3.33
2.54
3.50
WV-MOS
5.05
4.55
4.60
4.63
4.76
4.65
4.76
4.62
4.91
4.48
What each MOS metric measures
Mean MOS — Mean Opinion Score: average listener rating on a 1–5 scale across the test set; the canonical aggregate quality metric in TTS evaluation.
UTMOS — A predicted MOS from the UTMOS reference model — an automated proxy for subjective quality.
WV-MOS — A predicted MOS from the WavLM-based WV-MOS reference model — another automated proxy commonly reported alongside UTMOS for cross-validation.
Want to reproduce these results? See the TTS evaluation script to measure TTFB and synthesis quality in your own environment.
Supported Languages
Pro language support varies per voice — Indian Pro voices speak English with native Hindi code-switching; British and American Pro voices speak English only. For languages outside these, use the standard Lightning v3.1 model.
Voice group
Languages
Code-switching
Indian Pro voices
English (en), Hindi (hi)
English ↔ Hindi within a single utterance via language="auto"
British Pro voices
English (en)
—
American Pro voices
English (en)
—
Voice Catalog
The Pro voice catalog is distinct from standard Lightning v3.1. Voices below are listed in recommended ranking per accent group.
Indian — Female
Voice ID
Name
rhea
Rhea
zariya
Zariya
kareena
Kareena
mishka
Mishka
inaaya
Inaaya
saira
Saira
meher
Meher
aarini
Aarini
Indian — Male
Voice ID
Name
aviraj
Aviraj
vyom
Vyom
zoravar
Zoravar
reyansh
Reyansh
ahan
Ahan
British — Female
Voice ID
Name
cressida
Cressida
elowen
Elowen
ottilie
Ottilie
seraphina
Seraphina
tabitha
Tabitha
arabella
Arabella
British — Male
Voice ID
Name
benedict
Benedict
cormac
Cormac
everett
Everett
finley
Finley
rupert
Rupert
winston
Winston
caspian
Caspian
American — Female
Voice ID
Name
willow
Willow
autumn
Autumn
skylar
Skylar
savannah
Savannah
kennedy
Kennedy
reagan
Reagan
sierra
Sierra
American — Male
Voice ID
Name
maverick
Maverick
brooks
Brooks
hunter
Hunter
colton
Colton
wesley
Wesley
asher
Asher
Need a voice not in this list? Use the standard Lightning v3.1 catalog (217 voices, more languages, voice cloning). Pass "model": "lightning_v3.1" (or omit the field) instead of lightning_v3.1_pro.
API Reference
Endpoints
Endpoint
Method
Use Case
https://api.smallest.ai/waves/v1/tts
POST
Synchronous synthesis
https://api.smallest.ai/waves/v1/tts/live
POST (SSE)
Server-sent events streaming
wss://api.smallest.ai/waves/v1/tts/live
WebSocket
Real-time streaming
Request Parameters
Parameter
Type
Required
Default
Description
text
string
Yes
—
Text to synthesize
voice_id
string
Yes
—
Voice identifier (Pro catalog above)
model
string
No
lightning_v3.1
Pass lightning_v3.1_pro to route to the Pro pool. The field is optional, but the default routes to standard Lightning v3.1 — for Pro you must set it explicitly.
sample_rate
integer
No
44100
Output sample rate (Hz)
speed
float
No
1.0
Speech speed (0.5–2.0)
language
string
No
"auto"
en, hi, or auto (per voice’s tags.language)
output_format
string
No
"pcm"
pcm, mp3, wav, ulaw, alaw
pronunciation_dicts
array
No
—
List of pronunciation dictionary IDs — works on REST sync, SSE, and WebSocket
Use native script for each language. English in Latin script; Hindi in Devanagari
Break points
Natural punctuation (.!?,)
Mixed language
Use native script per language; avoid transliteration
Compute Infrastructure
Hardware
Recommended GPU: NVIDIA L40S
Recommended VRAM: 48 GB
Software
Server regions (AWS): India (Hyderabad), USA (Oregon)
Automatic geo-location based routing for lowest latency
Best Practices
Voice ID + model pairing
Pair Pro voice IDs above with "model": "lightning_v3.1_pro". The API does not currently reject mismatched pairings, but pairing a Pro voice with "model": "lightning_v3.1" (or omitting model) can produce wrong or hallucinated audio. Server-side validation is on the roadmap.
Language selection
Each voice’s supported languages live in tags.language on the voice catalog. Passing a language outside that list is accepted by the API but produces English-pronounced output, since the voice wasn’t trained on it. Pick a voice whose tags.language matches your target language.
Text Formatting
Chunk boundaries. Segment input at natural prosodic boundaries (.!?,). Maximum chunk size is 250 characters; optimal throughput at 140 characters per request.
Script integrity. Use native script for each language. Mixed-script input within a single language token produces unpredictable phoneme mappings.
Lexicon overrides. Use pronunciation dictionaries for domain-specific terms, brand names, and acronyms where default grapheme-to-phoneme conversion is insufficient.
For comprehensive text formatting rules (numeric handling, date/time, symbols, chunking logic), see TTS Best Practices.
Use Cases
Direct Use
Voice assistants and conversational AI
Interactive chatbots with voice output
Real-time narration and live streaming
Accessibility tools and screen readers
Customer service automation
Downstream Use
Multi-turn conversational agents
Audio content generation pipelines
Telephony and IVR systems
Podcast generation
Limitations & Safety
Known Limitations
No voice cloning. Voice cloning is not available on the Pro pool. Clones continue to use standard Lightning v3.1 and the existing voice-cloning flow.
Lightning v3.1 Pro must not be used for impersonation or fraud, generating deceptive audio content (deepfakes), creating content that violates consent or privacy, harassment or abuse, or any illegal or unethical purposes.
Safety & Compliance
No retention of synthesized audio
Usage monitoring for policy compliance
For compliance documentation (GDPR, SOC2, HIPAA), contact support@smallest.ai.