For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
      • Performance
      • Metrics Overview
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Latency
  • Time-to-First-Byte (TTFB)
  • Real-Time Factor (RTF)
  • Head-to-head listener ratings (Lightning v3.1 Standard)
  • Naturalness — higher is better
  • Expressiveness — higher is better
  • Delivery — higher is better
  • Accuracy
  • Whisper jiwer
  • Whisper LLM (Pro evaluation only)
  • MOS v2 — higher is better
  • Next Steps
Text to Speech (Lightning)Benchmarks

Performance

||View as Markdown|
Was this page helpful?
Previous

HTTP vs HTTP Streaming vs Websockets

Next

Metrics Overview

Built with

Head-to-head listener evaluation against production TTS systems on the EmergentTTS benchmark, 1,088 samples scored by the LLM-as-a-Judge framework. Tables below pair Lightning v3.1 (Standard) and Lightning v3.1 Pro with the same competitor set across Naturalness, Expressiveness, Delivery, Accuracy, and MOS v2 categories. Open the accordion under each one to see what each metric measures, or read the full Metrics Overview.

Latency

Time-to-First-Byte (TTFB)

TTFB measures the wall-clock delay between sending the synthesis request and the first audio byte arriving on the wire. Lower is better for real-time and conversational use cases.

ModelTTFBConditions
Lightning v3.1 (Standard)~200 ms40 concurrent requests, WebSocket streaming
Lightning v3.1 Pro~200 ms40 concurrent requests, WebSocket streaming, dedicated Pro pool

Real-Time Factor (RTF)

RTF = Audio Duration ÷ Processing Time. Values above 1.0 mean the model produces audio faster than playback. Both Standard and Pro run at 3.3× real-time on NVIDIA L40S, so a 10-second utterance is fully synthesized in ~3 seconds.

Head-to-head listener ratings (Lightning v3.1 Standard)

Direct head-to-head ratings on the EmergentTTS benchmark. Lightning Wins % is the share of samples where listeners preferred Lightning v3.1 over the competitor; Ties % is the share where both were rated equal; Competitor Wins % is the inverse. Each competitor column sums to 100%.

EmergentTTSGPT-4o-mini
OpenAI
Turbo v2.5
ElevenLabs
Multilingual v2
ElevenLabs
Sonic-3
Cartesia
Gemini 2.5 Pro
Google
MAI-Voice-1
Microsoft
Inworld 1.5
Inworld
S2 Pro
Fish Audio
Lightning Wins % (higher better)40.26%50.28%54.41%68.29%58.43%57.17%54.41%64.25%
Ties %24.17%25.00%23.81%17.00%8.29%17.00%18.11%13.60%
Competitor Wins % (lower better)35.57%24.72%21.78%14.71%33.27%25.83%27.48%22.15%

Naturalness — higher is better

MetricLightning v3.1Lightning v3.1 ProGPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
Overall3.253.163.133.163.173.203.073.283.173.063.02
Naturalness2.612.552.412.522.552.572.422.582.572.412.37
Intonation3.223.063.063.073.063.122.903.283.042.912.86
Prosody3.012.812.732.822.862.832.653.092.762.612.58
Pronunciation*3.63NA3.673.643.653.673.67NA3.683.683.57
Audio Quality3.76NA3.783.773.753.813.73NA3.793.703.75
What each Naturalness metric measures
  • Overall — Holistic listener rating of how natural the voice sounds end-to-end.
  • Naturalness — How human-like the voice sounds; penalizes robotic or synthetic quality.
  • Intonation — Whether pitch rises and falls appropriately for the sentence type (question, statement, exclamation).
  • Prosody — The broader umbrella of rhythm, stress, and melody, how well the voice “reads” the sentence as a human would.
  • Pronunciation — Whether individual words are phonetically correct, especially names, loanwords, and domain-specific terms.
  • Audio Quality — Technical cleanliness of the output; absence of artifacts, distortion, clipping, or background noise.
*Listener-rated Pronunciation and Audio Quality columns were measured only on the Standard evaluation; Pro’s Whisper-judged Pronunciation % appears under Accuracy below.

Expressiveness — higher is better

MetricLightning v3.1Lightning v3.1 ProGPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
Overall3.453.553.453.443.463.383.493.543.503.373.41
Paralinguistics3.613.643.603.593.613.563.603.643.583.553.58
Emotions3.293.473.303.283.313.193.383.443.413.193.23
What each Expressiveness metric measures
  • Overall — Holistic listener rating of how expressive the voice sounds given the context of the sentence.
  • Paralinguistics — Non-verbal vocal elements like laughter, sighs, or filler sounds (“um”, “uh”) and whether they’re rendered appropriately.
  • Emotions — How accurately the voice conveys the intended emotional tone (neutral, warm, urgent, etc.).

Delivery — higher is better

MetricLightning v3.1Lightning v3.1 ProGPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
Boundary Consistency4.944.964.944.934.954.934.884.994.774.904.88
Pronunciation Style4.944.984.964.954.964.964.934.994.914.944.89
Natural Pace4.474.724.574.514.514.014.234.664.474.333.74
Pause Placement4.464.664.544.494.514.284.344.594.414.384.09
Breathing Naturalness3.823.823.063.143.142.792.883.433.282.772.42
What each Delivery metric measures
  • Boundary Consistency — Whether phrase and sentence boundaries are marked consistently with pauses or pitch shifts, without arbitrary breaks mid-phrase.
  • Pronunciation Style — Not just correctness, but stylistic choices i.e., formal vs. casual register, regional accent consistency, honorific handling.
  • Natural Pace — Whether the speaking rate feels comfortable and appropriate for the content type, neither rushed nor dragging.
  • Pause Placement — Whether silences appear at semantically correct points (after commas, between clauses) rather than mid-word or mid-phrase.
  • Breathing Naturalness — Whether breath sounds occur at realistic points and with realistic frequency, not absent entirely or inserted randomly.

Accuracy

Mixed direction — WER, CER, Hallucination, and Deletion are lower is better; Pronunciation % is higher is better.

Whisper jiwer

MetricDirectionLightning v3.1Lightning v3.1 ProGPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
WERlower1.57%1.36%1.26%1.35%1.33%1.43%1.26%1.37%1.25%1.10%2.83%
CERlower0.67%0.40%0.52%0.60%0.54%0.59%0.62%0.61%0.50%0.47%1.16%
Hallucinationlower0.03%0.00%0.07%0.08%0.01%0.06%0.04%0.01%0.06%0.00%0.22%
DeletionlowerNA0.00%0.14%0.17%0.18%0.16%0.24%0.18%0.15%0.12%0.33%
Pronunciation %
Whisper jiwer
higher98.61%98.68%98.94%98.90%98.87%98.79%99.02%98.82%98.95%99.02%97.72%

Whisper LLM (Pro evaluation only)

LLM-judged Whisper transcripts, applied during the Pro benchmark run. The follow-on LLM normalizes punctuation, casing, and Whisper’s own transcription noise — typically reducing false-positive errors compared to jiwer. Standard Lightning v3.1 was not evaluated with this methodology.

MetricDirectionLightning v3.1 ProGPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
WERlower0.96%0.82%0.72%0.57%0.88%0.70%0.72%0.60%0.55%2.15%
CERlower0.34%0.30%0.28%0.21%0.30%0.35%0.33%0.23%0.18%1.03%
Hallucinationlower0.00%0.07%0.07%0.00%0.02%0.02%0.01%0.03%0.00%0.10%
Pronunciation %
Whisper LLM
higher99.04%99.25%99.35%99.43%99.14%99.32%99.29%99.43%99.45%97.95%
What each Accuracy metric measures
  • WER (Word Error Rate) — Percentage of words in the transcript that differ from the reference; measures how faithfully the TTS renders the input text.
  • CER (Character Error Rate) — Like WER but at the character level.
  • Hallucination — Words or sounds the TTS generates that have no basis in the input text. Insertions, substitutions, or fabricated content.
  • Deletion — Words from the reference text that the TTS dropped entirely.
  • Pronunciation % — The proportion of words pronounced correctly out of total words.
  • Whisper jiwer vs Whisper LLM — Two judging methodologies. jiwer uses raw Whisper-decoded transcripts; LLM-judged uses a follow-on LLM to normalize transcription noise. Both report the same metric family; LLM-judged tends to give lower error rates by reducing false positives from punctuation/casing.

For Pronunciation and WER, the residual gap on Lightning v3.1 (Standard) is concentrated in proper-noun rendering. Use a pronunciation dictionary to pin names, brands, and acronyms; with the dictionary applied, both metrics close to parity.

MOS v2 — higher is better

MetricLightning v3.1Lightning v3.1 ProGPT-4o-miniElevenLabs Turbo v2.5ElevenLabs Multilingual v2Sonic-3Gemini 2.5 ProGemini 2.5 FlashMAI-Voice-1Inworld 1.5S2 Pro
Mean MOSNA4.224.163.984.023.764.114.243.973.733.99
UTMOSNA3.763.763.373.412.773.573.713.332.543.50
WV-MOS4.715.054.554.604.634.764.654.764.624.914.48
What each MOS metric measures
  • Mean MOS — Mean Opinion Score: average listener rating on a 1–5 scale across the test set; the canonical aggregate quality metric in TTS evaluation.
  • UTMOS — A predicted MOS from the UTMOS reference model — an automated proxy for subjective quality.
  • WV-MOS — A predicted MOS from the WavLM-based WV-MOS reference model — another automated proxy commonly reported alongside UTMOS for cross-validation.

Want to reproduce these results? See the TTS evaluation script to measure TTFB and synthesis quality in your own environment.

Next Steps

  • Metrics Overview
  • Lightning v3.1 model card
  • Lightning v3.1 Pro model card