Lightning v3.1 Pro

View as Markdown
Latest Release

Lightning v3.1 Pro is a premium 44.1 kHz text-to-speech pool with improved naturalness and a curated voice catalog. Runs on dedicated inference capacity, isolated from general traffic. Concurrency, latency, and rate limits are identical to standard Lightning v3.1; the difference is voice quality and the catalog.

44.1 kHz

Native sample rate

200ms

TTFB at 40 concurrent requests

English + Hindi

Indian voices code-switch; British and American voices English-only

3.3x

Real-time factor (faster than playback)

Model Overview

Developed bySmallest AI
Model typeText-to-Speech / Speech Synthesis
LanguagesEnglish (en), Hindi (hi), auto
LicenseProprietary
Versionv3.1 Pro
Model IDlightning_v3.1_pro
Native sample rate44,100 Hz

Key Capabilities

Real-Time Optimized

Ultra-low latency architecture designed for conversational AI and live streaming.

Streaming

HTTP, SSE, and WebSocket support for real-time applications.

Multi-Language

Indian voices speak English + Hindi with automatic code-switching. British and American voices speak English.

High Fidelity

Broadcast-quality 44.1 kHz audio with natural prosody, intonation, and conversational rhythm.

Curated Voice Catalog

Premium voices across American, British, and Indian accents.

Pronunciation Control

Custom pronunciation dictionaries for specialized vocabulary, brand names, and domain-specific terms.


How to use it

Pro is selected via the model body parameter on the unified TTS routes — no separate endpoint to call.

$curl -X POST "https://api.smallest.ai/waves/v1/tts" \
> -H "Authorization: Bearer $SMALLEST_API_KEY" \
> -H "Content-Type: application/json" \
> -H "Accept: audio/wav" \
> -d '{
> "text": "Hello from the Lightning v3.1 Pro pool.",
> "voice_id": "meher",
> "model": "lightning_v3.1_pro",
> "language": "en",
> "sample_rate": 24000,
> "output_format": "wav"
> }' --output speech.wav

The same "model": "lightning_v3.1_pro" body field also routes to the Pro pool on the WebSocket and SSE endpoints.

On Atoms voice agents, open the agent’s voice picker and pick a Pro voice from the Pro filter chip. Atoms transparently routes to the Pro pool — no code change required.

Performance & Benchmarks

Pro improves on standard Lightning v3.1 across accuracy, expressiveness, delivery, and MOS quality. Tables below pair Pro with v3.1 Standard and the same competitor set documented on the Lightning v3.1 model card — open the accordion under each category to see what each metric measures.

Naturalness — higher is better

MetricLightning v3.1 ProLightning v3.1GPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
Overall3.163.253.133.163.173.203.073.283.173.063.02
Naturalness2.552.612.412.522.552.572.422.582.572.412.37
Intonation3.063.223.063.073.063.122.903.283.042.912.86
Prosody2.813.012.732.822.862.832.653.092.762.612.58
  • Overall — Holistic listener rating of how natural the voice sounds end-to-end.
  • Naturalness — How human-like the voice sounds; penalizes robotic or synthetic quality.
  • Intonation — Whether pitch rises and falls appropriately for the sentence type (question, statement, exclamation).
  • Prosody — The broader umbrella of rhythm, stress, and melody, how well the voice “reads” the sentence as a human would.

Expressiveness — higher is better

MetricLightning v3.1 ProLightning v3.1GPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
Overall3.553.453.453.443.463.383.493.543.503.373.41
Paralinguistics3.643.613.603.593.613.563.603.643.583.553.58
Emotions3.473.293.303.283.313.193.383.443.413.193.23
  • Overall — Holistic listener rating of how expressive the voice sounds given the context of the sentence.
  • Paralinguistics — Non-verbal vocal elements like laughter, sighs, or filler sounds (“um”, “uh”) and whether they’re rendered appropriately.
  • Emotions — How accurately the voice conveys the intended emotional tone (neutral, warm, urgent, etc.).

Delivery — higher is better

MetricLightning v3.1 ProLightning v3.1GPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
Boundary Consistency4.964.944.944.934.954.934.884.994.774.904.88
Pronunciation Style4.984.944.964.954.964.964.934.994.914.944.89
Natural Pace4.724.474.574.514.514.014.234.664.474.333.74
Pause Placement4.664.464.544.494.514.284.344.594.414.384.09
Breathing Naturalness3.823.823.063.143.142.792.883.433.282.772.42
  • Boundary Consistency — Whether phrase and sentence boundaries are marked consistently with pauses or pitch shifts, without arbitrary breaks mid-phrase.
  • Pronunciation Style — Not just correctness, but stylistic choices i.e., formal vs. casual register, regional accent consistency, honorific handling.
  • Natural Pace — Whether the speaking rate feels comfortable and appropriate for the content type, neither rushed nor dragging.
  • Pause Placement — Whether silences appear at semantically correct points (after commas, between clauses) rather than mid-word or mid-phrase.
  • Breathing Naturalness — Whether breath sounds occur at realistic points and with realistic frequency, not absent entirely or inserted randomly.

Accuracy

Mixed direction — WER, CER, Hallucination, and Deletion are lower is better; Pronunciation % is higher is better.

Whisper jiwer

MetricDirectionLightning v3.1 ProLightning v3.1GPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
WERlower1.36%1.57%1.26%1.35%1.33%1.43%1.26%1.37%1.25%1.10%2.83%
CERlower0.40%0.67%0.52%0.60%0.54%0.59%0.62%0.61%0.50%0.47%1.16%
Hallucinationlower0.00%0.03%0.07%0.08%0.01%0.06%0.04%0.01%0.06%0.00%0.22%
Deletionlower0.00%0.14%0.17%0.18%0.16%0.24%0.18%0.15%0.12%0.33%
Pronunciation %
Whisper jiwer
higher98.68%98.61%98.94%98.90%98.87%98.79%99.02%98.82%98.95%99.02%97.72%

Whisper LLM

MetricDirectionLightning v3.1 ProGPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
WERlower0.96%0.82%0.72%0.57%0.88%0.70%0.72%0.60%0.55%2.15%
CERlower0.34%0.30%0.28%0.21%0.30%0.35%0.33%0.23%0.18%1.03%
Hallucinationlower0.00%0.07%0.07%0.00%0.02%0.02%0.01%0.03%0.00%0.10%
Pronunciation %
Whisper LLM
higher99.04%99.25%99.35%99.43%99.14%99.32%99.29%99.43%99.45%97.95%
  • WER (Word Error Rate) — Percentage of words in the transcript that differ from the reference; measures how faithfully the TTS renders the input text.
  • CER (Character Error Rate) — Like WER but at the character level.
  • Hallucination — Words or sounds the TTS generates that have no basis in the input text. Insertions, substitutions, or fabricated content.
  • Deletion — Words from the reference text that the TTS dropped entirely.
  • Pronunciation % — The proportion of words pronounced correctly out of total words.
  • Whisper jiwer vs Whisper LLM — Two judging methodologies. jiwer uses raw Whisper-decoded transcripts; LLM-judged uses a follow-on LLM to normalize transcription noise. Both report the same metric family; LLM-judged tends to give lower error rates by reducing false positives from punctuation/casing.

MOS v2 — higher is better

MetricLightning v3.1 ProLightning v3.1GPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
Mean MOS4.224.163.984.023.764.114.243.973.733.99
UTMOS3.763.763.373.412.773.573.713.332.543.50
WV-MOS5.054.714.554.604.634.764.654.764.624.914.48
  • Mean MOS — Mean Opinion Score: average listener rating on a 1–5 scale across the test set; the canonical aggregate quality metric in TTS evaluation.
  • UTMOS — A predicted MOS from the UTMOS reference model — an automated proxy for subjective quality.
  • WV-MOS — A predicted MOS from the WavLM-based WV-MOS reference model — another automated proxy commonly reported alongside UTMOS for cross-validation.

Want to reproduce these results? See the TTS evaluation script to measure TTFB and synthesis quality in your own environment.


Supported Languages

Pro language support varies per voice — Indian Pro voices speak English with native Hindi code-switching; British and American Pro voices speak English only. For languages outside these, use the standard Lightning v3.1 model.

Voice groupLanguagesCode-switching
Indian Pro voicesEnglish (en), Hindi (hi)English ↔ Hindi within a single utterance via language="auto"
British Pro voicesEnglish (en)
American Pro voicesEnglish (en)

Voice Catalog

The Pro voice catalog is distinct from standard Lightning v3.1. Voices below are listed in recommended ranking per accent group.

Indian — Female

Voice IDName
rheaRhea
zariyaZariya
kareenaKareena
mishkaMishka
inaayaInaaya
sairaSaira
meherMeher
aariniAarini

Indian — Male

Voice IDName
avirajAviraj
vyomVyom
zoravarZoravar
reyanshReyansh
ahanAhan

British — Female

Voice IDName
cressidaCressida
elowenElowen
ottilieOttilie
seraphinaSeraphina
tabithaTabitha
arabellaArabella

British — Male

Voice IDName
benedictBenedict
cormacCormac
everettEverett
finleyFinley
rupertRupert
winstonWinston
caspianCaspian

American — Female

Voice IDName
willowWillow
autumnAutumn
skylarSkylar
savannahSavannah
kennedyKennedy
reaganReagan
sierraSierra

American — Male

Voice IDName
maverickMaverick
brooksBrooks
hunterHunter
coltonColton
wesleyWesley
asherAsher

Need a voice not in this list? Use the standard Lightning v3.1 catalog (217 voices, more languages, voice cloning). Pass "model": "lightning_v3.1" (or omit the field) instead of lightning_v3.1_pro.


API Reference

Endpoints

EndpointMethodUse Case
https://api.smallest.ai/waves/v1/ttsPOSTSynchronous synthesis
https://api.smallest.ai/waves/v1/tts/livePOST (SSE)Server-sent events streaming
wss://api.smallest.ai/waves/v1/tts/liveWebSocketReal-time streaming

Request Parameters

ParameterTypeRequiredDefaultDescription
textstringYesText to synthesize
voice_idstringYesVoice identifier (Pro catalog above)
modelstringNolightning_v3.1Pass lightning_v3.1_pro to route to the Pro pool. The field is optional, but the default routes to standard Lightning v3.1 — for Pro you must set it explicitly.
sample_rateintegerNo44100Output sample rate (Hz)
speedfloatNo1.0Speech speed (0.5–2.0)
languagestringNo"auto"en, hi, or auto (per voice’s tags.language)
output_formatstringNo"pcm"pcm, mp3, wav, ulaw, alaw
pronunciation_dictsarrayNoList of pronunciation dictionary IDs — works on REST sync, SSE, and WebSocket

Technical Specifications

Audio Output

SpecificationDetails
Native sample rate44,100 Hz
Supported sample rates8,000 / 16,000 / 24,000 / 44,100 Hz
Output formatsPCM, MP3, WAV, ulaw, alaw
Audio channelsMono

Text Formatting Guidelines

AspectRecommendation
Language scriptsUse native script for each language. English in Latin script; Hindi in Devanagari
Break pointsNatural punctuation (. ! ? ,)
Mixed languageUse native script per language; avoid transliteration

Hardware

  • Recommended GPU: NVIDIA L40S
  • Recommended VRAM: 48 GB

Software

  • Server regions (AWS): India (Hyderabad), USA (Oregon)
  • Automatic geo-location based routing for lowest latency

Best Practices

Voice ID + model pairing

Pair Pro voice IDs above with "model": "lightning_v3.1_pro". The API does not currently reject mismatched pairings, but pairing a Pro voice with "model": "lightning_v3.1" (or omitting model) can produce wrong or hallucinated audio. Server-side validation is on the roadmap.

Language selection

Each voice’s supported languages live in tags.language on the voice catalog. Passing a language outside that list is accepted by the API but produces English-pronounced output, since the voice wasn’t trained on it. Pick a voice whose tags.language matches your target language.

Text Formatting

  • Chunk boundaries. Segment input at natural prosodic boundaries (. ! ? ,). Maximum chunk size is 250 characters; optimal throughput at 140 characters per request.
  • Script integrity. Use native script for each language. Mixed-script input within a single language token produces unpredictable phoneme mappings.
  • Lexicon overrides. Use pronunciation dictionaries for domain-specific terms, brand names, and acronyms where default grapheme-to-phoneme conversion is insufficient.

For comprehensive text formatting rules (numeric handling, date/time, symbols, chunking logic), see TTS Best Practices.


Use Cases

Direct Use

  • Voice assistants and conversational AI
  • Interactive chatbots with voice output
  • Real-time narration and live streaming
  • Accessibility tools and screen readers
  • Customer service automation

Downstream Use

  • Multi-turn conversational agents
  • Audio content generation pipelines
  • Telephony and IVR systems
  • Podcast generation

Limitations & Safety

Known Limitations

  • No voice cloning. Voice cloning is not available on the Pro pool. Clones continue to use standard Lightning v3.1 and the existing voice-cloning flow.

Lightning v3.1 Pro must not be used for impersonation or fraud, generating deceptive audio content (deepfakes), creating content that violates consent or privacy, harassment or abuse, or any illegal or unethical purposes.

Safety & Compliance

  • No retention of synthesized audio
  • Usage monitoring for policy compliance

For compliance documentation (GDPR, SOC2, HIPAA), contact support@smallest.ai.


ChannelDetails
Supportsupport@smallest.ai
Documentationdocs.smallest.ai/waves
Consoleapp.smallest.ai
CommunityDiscord