Lightning v3.1
Lightning v3.1 is a high-fidelity, low-latency text-to-speech model delivering natural, expressive, and realistic speech at 44 kHz. Optimized for real-time applications with ultra-low latency and voice cloning support, it delivers broadcast-quality audio with genuinely conversational characteristics. Supports 12 languages across English, Hindi, Spanish, and 9 Indian languages.
Jump to: Benchmarks · Voice Catalog · Supported Languages · API Reference · Pricing & Throughput · Quickstart
Native sample rate
TTFB at 40 concurrent requests
Auto-detection + code-switching
Real-time factor (faster than playback)
Model Overview
Key Capabilities
Ultra-low latency architecture designed for conversational AI and live streaming.
Instant voice cloning with just 5–15 seconds of audio, via API or console.
HTTP, SSE, and WebSocket transports for real-time playback.
12 languages plus auto-detect, with automatic identification and code-switching mid-utterance.
Broadcast-quality 44.1 kHz audio with natural prosody, intonation, and conversational rhythm.
Custom pronunciation dictionaries for specialized vocabulary, brand names, and domain-specific terms.
How to use it
See the Lightning quickstart for a working end-to-end example, including authentication, request shape, and audio playback. Lightning v3.1 is selected via the model body field on the unified TTS endpoints — POST /waves/v1/tts for synchronous synthesis, POST /waves/v1/tts/live for SSE streaming, and WSS /waves/v1/tts/live for WebSocket streaming. Pass "model": "lightning_v3.1" to route to the Standard pool, or "model": "lightning_v3.1_pro" to route to the Pro pool.
Performance & Benchmarks
Head-to-head listener evaluation against eight production TTS systems on the EmergentTTS benchmark, 1,088 samples scored by the LLM-as-a-Judge framework. The first table is the win-rate breakdown per competitor; the per-metric scores are split by category below it.
Win, tie, loss against each competitor
Direct head-to-head listener ratings. Lightning Wins % is the share where Lightning v3.1 was preferred. Ties % is the share where listeners scored both equally. Competitor Wins % is the inverse. Each competitor column sums to 100%.
Per-metric scores
Mean listener score per metric across the same 1,088-sample test set. Tables are split by category — open the accordion under each one to see what each metric measures.
Naturalness — higher is better
What each Naturalness metric measures
- Overall — Holistic listener rating of how natural the voice sounds end-to-end.
- Naturalness — How human-like the voice sounds; penalizes robotic or synthetic quality.
- Intonation — Whether pitch rises and falls appropriately for the sentence type (question, statement, exclamation).
- Prosody — The broader umbrella of rhythm, stress, and melody, how well the voice “reads” the sentence as a human would.
- Pronunciation — Whether individual words are phonetically correct, especially names, loanwords, and domain-specific terms.
- Audio Quality — Technical cleanliness of the output; absence of artifacts, distortion, clipping, or background noise.
Expressiveness — higher is better
What each Expressiveness metric measures
- Overall — Holistic listener rating of how expressive the voice sounds given the context of the sentence.
- Paralinguistics — Non-verbal vocal elements like laughter, sighs, or filler sounds (“um”, “uh”) and whether they’re rendered appropriately.
- Emotions — How accurately the voice conveys the intended emotional tone (neutral, warm, urgent, etc.).
Delivery — higher is better
What each Delivery metric measures
- Boundary Consistency — Whether phrase and sentence boundaries are marked consistently with pauses or pitch shifts, without arbitrary breaks mid-phrase.
- Pronunciation Style — Not just correctness, but stylistic choices i.e., formal vs. casual register, regional accent consistency, honorific handling.
- Natural Pace — Whether the speaking rate feels comfortable and appropriate for the content type, neither rushed nor dragging.
- Pause Placement — Whether silences appear at semantically correct points (after commas, between clauses) rather than mid-word or mid-phrase.
- Breathing Naturalness — Whether breath sounds occur at realistic points and with realistic frequency, not absent entirely or inserted randomly.
Accuracy
Mixed direction — most are lower is better; the Whisper-judged Pronunciation % is higher is better.
What each Accuracy metric measures
- WER (Word Error Rate) — Percentage of words in the transcript that differ from the reference; measures how faithfully the TTS renders the input text.
- CER (Character Error Rate) — Like WER but at the character level.
- Hallucination — Words or sounds the TTS generates that have no basis in the input text. Insertions, substitutions, or fabricated content.
- Pronunciation % (Whisper jiwer) — The proportion of words pronounced correctly out of total words.
MOS v2 — higher is better
What WV-MOS measures
- WV-MOS — The average of all listener ratings on a 1–5 scale across a test set; the standard aggregate quality metric in TTS evaluation.
*For Pronunciation and WER, the residual gap on Lightning v3.1 is concentrated in proper-noun rendering. Use a pronunciation dictionary to pin names, brands, and acronyms; with the dictionary applied, both metrics close to parity.
Want to reproduce these results? See the TTS evaluation script to measure TTFB and synthesis quality in your own environment.
Supported Languages
Pass the language code on each request to match the language of your input text. Each voice supports a subset of these languages — check tags.language on the voice via GET /waves/v1/lightning-v3.1/get_voices.
The list above reflects the voice catalog as the source of truth — languages for which Lightning v3.1 has at least one trained voice.
Top Voices
Curated short-list of the voices we’d recommend for production. Use these voice_id values directly in the voice_id parameter — no setup required. The full Voice Catalog below has the complete list across additional languages.
English (American)
English (Other accents)
Indic (Hindi + English, Indian accent)
Need something not in this short-list? Call GET /waves/v1/lightning-v3.1/get_voices (217 voices total) or browse the full catalog below. Each voice in the API response includes tags.language, tags.accent, tags.age, and tags.gender so you can filter programmatically.
Voice Catalog
English (US) — Best Voices
Hindi / English — Best Voices
Spanish — Best Voices
Other Indian Languages — Best Voices
Voice Cloning
Audio required: 5-15 seconds
Self-serve voice cloning available via API and console. Captures core voice characteristics for quick replication.
API Reference
Endpoints
/waves/v1/tts* routes and select the model with "model": "lightning_v3.1". The model-named routes below remain live but are deprecated.Recommended (unified routes — model selected via model body field):
Legacy (deprecated — kept for backwards compatibility):
Request Parameters
Throughput, Latency & Pricing
Rate limits, concurrency caps, and pricing tiers are documented on the Concurrency & Limits page. For enterprise pricing, contact sales@smallest.ai.
Word-level timestamps
Real-TimePass word_timestamps: true on a WebSocket request to receive per-word timing events interleaved with the audio stream. Each event tells you exactly when a word starts and ends in the generated audio — useful for captioning, karaoke-style highlighting, avatar lip-sync, and word-level analytics.
Request
Response frames
The server interleaves word_timestamp frames with chunk frames in audio-time order, followed by a single complete:
Each word_timestamp frame:
Voice + language support matrix
Word events are emitted only when the chosen voice family has the aligner checkpoint baked in.
For unsupported voice families the flag is accepted — audio works normally, but no word_timestamp frames are emitted. Detect this client-side by counting received word_timestamp frames after complete arrives.
Backward compatibility
word_timestamps defaults to false. Clients that don’t set the flag see no behavior change — same audio chunks, same completion frame, no new event type to handle.
JavaScript example
Where it works
If you need word timing, use the WebSocket path.
Best Practices
Code-Switching
Lightning v3.1 supports real-time intra-session language switching via two mutually exclusive language groups. Each group shares a unified phoneme space, enabling seamless mid-utterance transitions between member languages without session re-initialization. Cross-group switching is not supported within a single session.
Language Groups
Indic Group. Optimized for South Asian language pairs with English as the bridging language.
Global Group. Optimized for European language pairs with English and Hindi as bridging languages.
Intra-group switching is unrestricted. Any language within the same group can be interleaved at the token level. Cross-group switching (e.g., Tamil from Indic + French from Global) is architecturally unsupported and will produce undefined behavior.
en and hi exist in both groups. All other languages are exclusive to one group. The group is determined at session initialization based on the first non-shared language encountered. Design your session’s language set accordingly.
Routing Examples
Voice Cloning
Reference Audio
- Environment. Record in a quiet room with no background noise, hiss, or rumble. Ambient sound is captured in the clone and cannot be removed after the fact.
- Speaking style. Speak naturally in your normal conversational voice. The model captures timbre, accent, emotional tone, rhythm, and pacing automatically. Do not exaggerate unless a specific tone is intended.
- Audio length. Provide 5 to 15 seconds of clean, continuous speech.
Multi-Lingual Cloning
- Language matching. For best results, record reference audio in the same language as your intended output. Cross-lingual cloning is supported (e.g., English reference used for Spanish output), but a language-matched reference produces higher fidelity.
- Accent retention. When synthesizing in a different language than the reference, the original accent is preserved. A clone from a South Indian English speaker will retain that accent in Hindi or Tamil output. This is by design: the clone reproduces your voice, including accent characteristics. For accent-neutral output in a specific language, provide reference audio from a native speaker of that language.
- Script encoding. Input text must use native script for each language (Devanagari for Hindi/Marathi/Gujarati, respective Brahmic scripts for Dravidian languages, Latin for European languages). Transliterated input degrades synthesis quality.
- Group constraint. Cloned voices follow the same language group routing rules. A session initialized in the Indic group cannot switch to Global-exclusive languages, regardless of the voice’s source language.
For detailed recording examples and expressive cloning techniques, see Voice Cloning Best Practices.
Text Formatting
- Chunk boundaries. Segment input at natural prosodic boundaries (
.!?,). Maximum chunk size is 250 characters; optimal throughput at 140 characters per request. - Script integrity. Avoid transliteration. Use native script for each language. Mixed-script input within a single language token produces unpredictable phoneme mappings.
- Numeric normalization. Use standard formats (
DD/MM/YYYY,HH:MM). Phone numbers default to 3-4-3 digit grouping. - Lexicon overrides. Use pronunciation dictionaries for domain-specific terms, brand names, and acronyms where default grapheme-to-phoneme conversion is insufficient.
For comprehensive text formatting rules (numeric handling, date/time, symbols, chunking logic), see TTS Best Practices.
Technical Specifications
Audio Output
Text Formatting Guidelines
Number & Date Handling
Compute Infrastructure
Hardware
- Recommended GPU: NVIDIA L40S
- Recommended VRAM: 48 GB
Software
- Server regions (AWS): India (Hyderabad), USA (Oregon)
- Automatic geo-location based routing for lowest latency
Use Cases
Direct Use
- Voice assistants and conversational AI
- Interactive chatbots with voice output
- Real-time narration and live streaming
- Accessibility tools and screen readers
- Gaming (dynamic character voices)
- Customer service automation
Downstream Use
- Multi-turn conversational agents
- Audio content generation pipelines
- Telephony and IVR systems
- Podcast and audiobook generation
Safety & Compliance
Known Limitations
- Mixed-language text (transliteration) may produce suboptimal results. Hindi text should be in Devanagari script (e.g., “namaste” in Devanagari), not Latin. English text should be in Latin script, not Devanagari. Each language should use its native script.
Recommendations: Use proper script for each language. Break long text at natural punctuation points. Use pronunciation dictionaries for specialized vocabulary. Test voice selection for your specific use case.
Lightning v3.1 must not be used for impersonation or fraud, generating deceptive audio content (deepfakes), creating content that violates consent or privacy, harassment or abuse, or any illegal or unethical purposes.
Compliance
- Voice cloning requires explicit consent
- No retention of synthesized audio
- No storage of personal voice data beyond cloning scope
- Usage monitoring for policy compliance
For compliance documentation (GDPR, SOC2, HIPAA), contact support@smallest.ai.
Support
- Email: support@smallest.ai
- Community: Discord
- Documentation: docs.smallest.ai/waves
- Console: app.smallest.ai/dashboard

