Lightning v3.1 Pro
Lightning v3.1 Pro is a premium 44.1 kHz text-to-speech pool with improved naturalness and a curated voice catalog. Runs on dedicated inference capacity, isolated from general traffic. Concurrency, latency, and rate limits are identical to standard Lightning v3.1; the difference is voice quality and the catalog.
Native sample rate
TTFB at 40 concurrent requests
Indian voices code-switch; British and American voices English-only
Real-time factor (faster than playback)
Model Overview
Key Capabilities
Ultra-low latency architecture designed for conversational AI and live streaming.
HTTP, SSE, and WebSocket support for real-time applications.
Indian voices speak English + Hindi with automatic code-switching. British and American voices speak English.
Broadcast-quality 44.1 kHz audio with natural prosody, intonation, and conversational rhythm.
Premium voices across American, British, and Indian accents.
Custom pronunciation dictionaries for specialized vocabulary, brand names, and domain-specific terms.
How to use it
Pro is selected via the model body parameter on the unified TTS routes — no separate endpoint to call.
The same "model": "lightning_v3.1_pro" body field also routes to the Pro pool on the WebSocket and SSE endpoints.
Performance & Benchmarks
Pro improves on standard Lightning v3.1 across accuracy, expressiveness, delivery, and MOS quality. Tables below pair Pro with v3.1 Standard and the same competitor set documented on the Lightning v3.1 model card — open the accordion under each category to see what each metric measures.
Naturalness — higher is better
What each Naturalness metric measures
- Overall — Holistic listener rating of how natural the voice sounds end-to-end.
- Naturalness — How human-like the voice sounds; penalizes robotic or synthetic quality.
- Intonation — Whether pitch rises and falls appropriately for the sentence type (question, statement, exclamation).
- Prosody — The broader umbrella of rhythm, stress, and melody, how well the voice “reads” the sentence as a human would.
Expressiveness — higher is better
What each Expressiveness metric measures
- Overall — Holistic listener rating of how expressive the voice sounds given the context of the sentence.
- Paralinguistics — Non-verbal vocal elements like laughter, sighs, or filler sounds (“um”, “uh”) and whether they’re rendered appropriately.
- Emotions — How accurately the voice conveys the intended emotional tone (neutral, warm, urgent, etc.).
Delivery — higher is better
What each Delivery metric measures
- Boundary Consistency — Whether phrase and sentence boundaries are marked consistently with pauses or pitch shifts, without arbitrary breaks mid-phrase.
- Pronunciation Style — Not just correctness, but stylistic choices i.e., formal vs. casual register, regional accent consistency, honorific handling.
- Natural Pace — Whether the speaking rate feels comfortable and appropriate for the content type, neither rushed nor dragging.
- Pause Placement — Whether silences appear at semantically correct points (after commas, between clauses) rather than mid-word or mid-phrase.
- Breathing Naturalness — Whether breath sounds occur at realistic points and with realistic frequency, not absent entirely or inserted randomly.
Accuracy
Mixed direction — WER, CER, Hallucination, and Deletion are lower is better; Pronunciation % is higher is better.
Whisper jiwer
Whisper LLM
What each Accuracy metric measures
- WER (Word Error Rate) — Percentage of words in the transcript that differ from the reference; measures how faithfully the TTS renders the input text.
- CER (Character Error Rate) — Like WER but at the character level.
- Hallucination — Words or sounds the TTS generates that have no basis in the input text. Insertions, substitutions, or fabricated content.
- Deletion — Words from the reference text that the TTS dropped entirely.
- Pronunciation % — The proportion of words pronounced correctly out of total words.
- Whisper jiwer vs Whisper LLM — Two judging methodologies.
jiweruses raw Whisper-decoded transcripts; LLM-judged uses a follow-on LLM to normalize transcription noise. Both report the same metric family; LLM-judged tends to give lower error rates by reducing false positives from punctuation/casing.
MOS v2 — higher is better
What each MOS metric measures
- Mean MOS — Mean Opinion Score: average listener rating on a 1–5 scale across the test set; the canonical aggregate quality metric in TTS evaluation.
- UTMOS — A predicted MOS from the UTMOS reference model — an automated proxy for subjective quality.
- WV-MOS — A predicted MOS from the WavLM-based WV-MOS reference model — another automated proxy commonly reported alongside UTMOS for cross-validation.
Want to reproduce these results? See the TTS evaluation script to measure TTFB and synthesis quality in your own environment.
Supported Languages
Pro language support varies per voice — Indian Pro voices speak English with native Hindi code-switching; British and American Pro voices speak English only. For languages outside these, use the standard Lightning v3.1 model.
Voice Catalog
The Pro voice catalog is distinct from standard Lightning v3.1. Voices below are listed in recommended ranking per accent group.
Indian — Female
Indian — Male
British — Female
British — Male
American — Female
American — Male
Need a voice not in this list? Use the standard Lightning v3.1 catalog (217 voices, more languages, voice cloning). Pass "model": "lightning_v3.1" (or omit the field) instead of lightning_v3.1_pro.
API Reference
Endpoints
Request Parameters
Technical Specifications
Audio Output
Text Formatting Guidelines
Compute Infrastructure
Hardware
- Recommended GPU: NVIDIA L40S
- Recommended VRAM: 48 GB
Software
- Server regions (AWS): India (Hyderabad), USA (Oregon)
- Automatic geo-location based routing for lowest latency
Best Practices
Voice ID + model pairing
Pair Pro voice IDs above with "model": "lightning_v3.1_pro". The API does not currently reject mismatched pairings, but pairing a Pro voice with "model": "lightning_v3.1" (or omitting model) can produce wrong or hallucinated audio. Server-side validation is on the roadmap.
Language selection
Each voice’s supported languages live in tags.language on the voice catalog. Passing a language outside that list is accepted by the API but produces English-pronounced output, since the voice wasn’t trained on it. Pick a voice whose tags.language matches your target language.
Text Formatting
- Chunk boundaries. Segment input at natural prosodic boundaries (
.!?,). Maximum chunk size is 250 characters; optimal throughput at 140 characters per request. - Script integrity. Use native script for each language. Mixed-script input within a single language token produces unpredictable phoneme mappings.
- Lexicon overrides. Use pronunciation dictionaries for domain-specific terms, brand names, and acronyms where default grapheme-to-phoneme conversion is insufficient.
For comprehensive text formatting rules (numeric handling, date/time, symbols, chunking logic), see TTS Best Practices.
Use Cases
Direct Use
- Voice assistants and conversational AI
- Interactive chatbots with voice output
- Real-time narration and live streaming
- Accessibility tools and screen readers
- Customer service automation
Downstream Use
- Multi-turn conversational agents
- Audio content generation pipelines
- Telephony and IVR systems
- Podcast generation
Limitations & Safety
Known Limitations
- No voice cloning. Voice cloning is not available on the Pro pool. Clones continue to use standard Lightning v3.1 and the existing voice-cloning flow.
Lightning v3.1 Pro must not be used for impersonation or fraud, generating deceptive audio content (deepfakes), creating content that violates consent or privacy, harassment or abuse, or any illegal or unethical purposes.
Safety & Compliance
- No retention of synthesized audio
- Usage monitoring for policy compliance
For compliance documentation (GDPR, SOC2, HIPAA), contact support@smallest.ai.

