Lightning v3.1

View as Markdown
Latest Release

Lightning v3.1 is a high-fidelity, low-latency text-to-speech model delivering natural, expressive, and realistic speech at 44 kHz. Optimized for real-time applications with ultra-low latency and voice cloning support, it delivers broadcast-quality audio with genuinely conversational characteristics. Now with 15 languages, automatic language detection, and code-switching.

44.1 kHz

Native sample rate

100ms

Latency at 20 concurrent requests

15 Languages

Auto-detection + code-switching

3.3x

Real-time factor (faster than playback)

Model Overview

Developed bySmallest AI
Model typeText-to-Speech / Speech Synthesis
Languages15 (auto-detection + code-switching)
LicenseProprietary
Versionv3.1
Native sample rate44,100 Hz

Key Capabilities

Real-Time Optimized

Ultra-low latency architecture designed for conversational AI and live streaming.

Voice Cloning

Instant voice cloning with just 5-15 seconds of audio via API and console.

Streaming

HTTP, SSE, and WebSocket support for real-time applications.

Multi-Language

15 languages with automatic detection and code-switching. No restarts or reconnections needed.

High Fidelity

Broadcast-quality 44.1 kHz audio with natural prosody, intonation, and conversational rhythm.

Pronunciation Control

Custom pronunciation dictionaries for specialized vocabulary, brand names, and domain-specific terms.


Performance & Benchmarks

Audio Generation Evaluation

Full-sentence audio generation. The entire text is synthesized in a single pass and then evaluated.

Evaluation: Seed TTS dataset, 1,088 English samples. LLM-as-a-Judge framework.

CategoryMetricScoreNotes
Audio QualityWVMOS5.06Broadcast-quality audio
Naturalness4.33Predominantly human-like
Overall Quality4.42Premium-tier experience
Native Sample Rate44.1 kHzHighest fidelity among Lightning models
IntelligibilityWord Error Rate (WER)6.3%93.7% word accuracy
Character Error Rate (CER)1.6%Excellent character-level accuracy
Latency & SpeedLatency100msAt 20 concurrent requests
Real-Time Factor (RTF)0.33.3x faster than playback
Speed Control0.5x - 2.0xAdjustable playback speed
Max Chunk Size250 charsOptimal: 140 characters per request
ProsodyPronunciation4.70 / 5.0Near-perfect articulation
Intonation4.71 / 5.0Highly expressive pitch variation
Prosody4.47 / 5.0Natural conversational rhythm

Agent Call Evaluation

Chunk-by-chunk audio generation. Simulates real-world voice agent behavior where text is streamed and synthesized incrementally, as it happens during live calls.

Evaluation: Seed TTS dataset, 1,088 English samples. LLM-as-a-Judge framework.

CategoryMetricScoreNotes
Audio QualityMOS3.89Highest among all models tested
Audio Quality3.80Broadcast-quality audio
Overall Naturalness3.33Most natural-sounding output
Naturalness2.67
IntelligibilityWord Error Rate (WER)5.38%94.6% word accuracy
Character Error Rate (CER)1.54%Excellent character-level accuracy
ProsodyPronunciation3.80Near-perfect articulation
Intonation3.33Expressive pitch variation
Prosody3.07Natural conversational rhythm

Want to reproduce these results? See the TTS evaluation script to measure TTFB and synthesis quality in your own environment.


Supported Languages

Automatic Language Detection & Code-Switching: Set language to "auto" (default) and Lightning v3.1 will automatically detect the language from input text. The model also supports code-switching within a single session without requiring a restart or reconnection.

LanguageCodeStatus
EnglishenAvailable
SpanishesAvailable
HindihiAvailable
TamiltaAvailable
KannadaknAvailable
TeluguteAvailable
MalayalammlAvailable
MarathimrAvailable
GujaratiguAvailable
FrenchfrAvailable Beta
ItalianitAvailable Beta
DutchnlAvailable Beta
SwedishsvAvailable Beta
PortugueseptAvailable Beta
GermandeAvailable Beta

Voice Catalog

English (US) — Best Voices

Voice IDNameGender
quinnQuinnFemale
miaMiaFemale
magnusMagnusMale
oliviaOliviaFemale
danielDanielMale
rachelRachelFemale
nicoleNicoleFemale
elizabethElizabethFemale

Hindi / English — Best Voices

Voice IDNameGender
neelNeelMale
maithiliMaithiliFemale
devanshDevanshMale
sameeraSameeraFemale
mihirMihirMale
aarushAarushMale
sakshiSakshiFemale
vivaanVivaanMale
srishtiSrishtiFemale

Spanish — Best Voices

Voice IDNameGender
daniellaDaniellaFemale
sandraSandraFemale
carlosCarlosMale
joseJoseMale
luisLuisMale
marianaMarianaFemale
miguelMiguelMale

Other Indian Languages — Best Voices

LanguageVoice IDNameGender
TamiljeevanJeevanMale
TamilrajeshwariRajeshwariFemale
MalayalamvaisakhVaisakhMale
MalayalamshibiShibiFemale
TelugusrihariSrihariMale
TelugupadmajaPadmajaFemale
MarathirupaliRupaliFemale
MarathinileshNileshMale
GujaratiniharikaNiharikaFemale
GujaratidhruvitDhruvitMale
KannadadeepashriDeepashriFemale
KannadapranavPranavMale

Voice Cloning

Instant Voice Cloning

Audio required: 5-15 seconds

Self-serve voice cloning available via API and console. Captures core voice characteristics for quick replication.


API Reference

Endpoints

EndpointMethodUse Case
https://api.smallest.ai/waves/v1/lightning-v3.1/get_speechPOSTSynchronous synthesis
https://api.smallest.ai/waves/v1/lightning-v3.1/streamPOST (SSE)Server-sent events streaming
wss://api.smallest.ai/waves/v1/lightning-v3.1/get_speech/streamWebSocketReal-time streaming

Request Parameters

ParameterTypeRequiredDefaultDescription
textstringYesText to synthesize
voice_idstringYesVoice identifier
sample_rateintegerNo44100Output sample rate (Hz)
speedfloatNo1.0Speech speed (0.5-2.0)
languagestringNo"auto"Language code or "auto" for automatic detection
output_formatstringNo"pcm"Audio format
pronunciation_dictsarrayNoCustom pronunciation IDs (WebSocket only)

Technical Specifications

Audio Output

SpecificationDetails
Native sample rate44,100 Hz
Supported sample rates8,000 / 16,000 / 24,000 / 44,100 Hz
Output formatsPCM, MP3, WAV, mulaw
Audio channelsMono

Text Formatting Guidelines

AspectRecommendation
Language scriptsUse native script for each language. English/Spanish/French/Italian/Dutch/Swedish/Portuguese/German in Latin script, Hindi/Marathi/Gujarati in Devanagari, Tamil/Kannada/Telugu/Malayalam in their native scripts
Break pointsNatural punctuation (. ! ? ,)
Mixed languageAvoid transliteration. Use native script for each language

Number & Date Handling

TypeFormat
Phone numbersDefault 3-4-3 grouping
DatesDD/MM/YYYY or DD-MM-YYYY
TimeHH:MM or HH:MM:SS

Hardware

  • Recommended GPU: NVIDIA L40S
  • Recommended VRAM: 48 GB

Software

  • Server regions (AWS): India (Hyderabad), USA (Oregon)
  • Automatic geo-location based routing for lowest latency

Best Practices

Code-Switching

Lightning v3.1 supports real-time intra-session language switching via two mutually exclusive language groups. Each group shares a unified phoneme space, enabling seamless mid-utterance transitions between member languages without session re-initialization. Cross-group switching is not supported within a single session.

Language Groups

Indic Group. Optimized for South Asian language pairs with English as the bridging language.

LanguageCode
Englishen
Hindihi
Tamilta
Telugute
Malayalamml
Kannadakn
Marathimr
Gujaratigu

Global Group. Optimized for European language pairs with English and Hindi as bridging languages.

LanguageCode
Englishen
Hindihi
Spanishes
Frenchfr
Italianit
Portuguesept
Germande
Dutchnl
Swedishsv

Intra-group switching is unrestricted. Any language within the same group can be interleaved at the token level. Cross-group switching (e.g., Tamil from Indic + French from Global) is architecturally unsupported and will produce undefined behavior.

en and hi exist in both groups. All other languages are exclusive to one group. The group is determined at session initialization based on the first non-shared language encountered. Design your session’s language set accordingly.

Routing Examples

// Indic group — Hindi <-> Tamil interleaving
"Valid: all languages within Indic group"
// Global group — Spanish <-> French interleaving
"Valid: all languages within Global group"
// Cross-group — Tamil (Indic) + French (Global)
"Invalid: cross-group switching unsupported"

Voice Cloning

Reference Audio

  • Environment. Record in a quiet room with no background noise, hiss, or rumble. Ambient sound is captured in the clone and cannot be removed after the fact.
  • Speaking style. Speak naturally in your normal conversational voice. The model captures timbre, accent, emotional tone, rhythm, and pacing automatically. Do not exaggerate unless a specific tone is intended.
  • Audio length. Provide 5 to 15 seconds of clean, continuous speech.

Multi-Lingual Cloning

  • Language matching. For best results, record reference audio in the same language as your intended output. Cross-lingual cloning is supported (e.g., English reference used for Spanish output), but a language-matched reference produces higher fidelity.
  • Accent retention. When synthesizing in a different language than the reference, the original accent is preserved. A clone from a South Indian English speaker will retain that accent in Hindi or Tamil output. This is by design: the clone reproduces your voice, including accent characteristics. For accent-neutral output in a specific language, provide reference audio from a native speaker of that language.
  • Script encoding. Input text must use native script for each language (Devanagari for Hindi/Marathi/Gujarati, respective Brahmic scripts for Dravidian languages, Latin for European languages). Transliterated input degrades synthesis quality.
  • Group constraint. Cloned voices follow the same language group routing rules. A session initialized in the Indic group cannot switch to Global-exclusive languages, regardless of the voice’s source language.

For detailed recording examples and expressive cloning techniques, see Voice Cloning Best Practices.

Text Formatting

  • Chunk boundaries. Segment input at natural prosodic boundaries (. ! ? ,). Maximum chunk size is 250 characters; optimal throughput at 140 characters per request.
  • Script integrity. Avoid transliteration. Use native script for each language. Mixed-script input within a single language token produces unpredictable phoneme mappings.
  • Numeric normalization. Use standard formats (DD/MM/YYYY, HH:MM). Phone numbers default to 3-4-3 digit grouping.
  • Lexicon overrides. Use pronunciation dictionaries for domain-specific terms, brand names, and acronyms where default grapheme-to-phoneme conversion is insufficient.

For comprehensive text formatting rules (numeric handling, date/time, symbols, chunking logic), see TTS Best Practices.


Use Cases

Direct Use

  • Voice assistants and conversational AI
  • Interactive chatbots with voice output
  • Real-time narration and live streaming
  • Accessibility tools and screen readers
  • Gaming (dynamic character voices)
  • Customer service automation

Downstream Use

  • Multi-turn conversational agents
  • Audio content generation pipelines
  • Telephony and IVR systems
  • Podcast and audiobook generation

Limitations & Safety

Known Limitations

  • Mixed-language text (transliteration) may produce suboptimal results. Hindi text should be in Devanagari script (e.g., “namaste” in Devanagari), not Latin. English text should be in Latin script, not Devanagari. Each language should use its native script.

Recommendations: Use proper script for each language. Break long text at natural punctuation points. Use pronunciation dictionaries for specialized vocabulary. Test voice selection for your specific use case.

Lightning v3.1 must not be used for impersonation or fraud, generating deceptive audio content (deepfakes), creating content that violates consent or privacy, harassment or abuse, or any illegal or unethical purposes.

Safety & Compliance

  • Voice cloning requires explicit consent
  • No retention of synthesized audio
  • No storage of personal voice data beyond cloning scope
  • Usage monitoring for policy compliance

For compliance documentation (GDPR, SOC2, HIPAA), contact support@smallest.ai.


ChannelDetails
Supportsupport@smallest.ai
Documentationdocs.smallest.ai/waves
Consoleapp.smallest.ai
CommunityDiscord