Performance
Head-to-head listener evaluation against production TTS systems on the EmergentTTS benchmark, 1,088 samples scored by the LLM-as-a-Judge framework. Tables below pair Lightning v3.1 (Standard) and Lightning v3.1 Pro with the same competitor set across Naturalness, Expressiveness, Delivery, Accuracy, and MOS v2 categories. Open the accordion under each one to see what each metric measures, or read the full Metrics Overview.
Latency
Time-to-First-Byte (TTFB)
TTFB measures the wall-clock delay between sending the synthesis request and the first audio byte arriving on the wire. Lower is better for real-time and conversational use cases.
Real-Time Factor (RTF)
RTF = Audio Duration ÷ Processing Time. Values above 1.0 mean the model produces audio faster than playback. Both Standard and Pro run at 3.3× real-time on NVIDIA L40S, so a 10-second utterance is fully synthesized in ~3 seconds.
Head-to-head listener ratings (Lightning v3.1 Standard)
Direct head-to-head ratings on the EmergentTTS benchmark. Lightning Wins % is the share of samples where listeners preferred Lightning v3.1 over the competitor; Ties % is the share where both were rated equal; Competitor Wins % is the inverse. Each competitor column sums to 100%.
Naturalness — higher is better
What each Naturalness metric measures
- Overall — Holistic listener rating of how natural the voice sounds end-to-end.
- Naturalness — How human-like the voice sounds; penalizes robotic or synthetic quality.
- Intonation — Whether pitch rises and falls appropriately for the sentence type (question, statement, exclamation).
- Prosody — The broader umbrella of rhythm, stress, and melody, how well the voice “reads” the sentence as a human would.
- Pronunciation — Whether individual words are phonetically correct, especially names, loanwords, and domain-specific terms.
- Audio Quality — Technical cleanliness of the output; absence of artifacts, distortion, clipping, or background noise.
Expressiveness — higher is better
What each Expressiveness metric measures
- Overall — Holistic listener rating of how expressive the voice sounds given the context of the sentence.
- Paralinguistics — Non-verbal vocal elements like laughter, sighs, or filler sounds (“um”, “uh”) and whether they’re rendered appropriately.
- Emotions — How accurately the voice conveys the intended emotional tone (neutral, warm, urgent, etc.).
Delivery — higher is better
What each Delivery metric measures
- Boundary Consistency — Whether phrase and sentence boundaries are marked consistently with pauses or pitch shifts, without arbitrary breaks mid-phrase.
- Pronunciation Style — Not just correctness, but stylistic choices i.e., formal vs. casual register, regional accent consistency, honorific handling.
- Natural Pace — Whether the speaking rate feels comfortable and appropriate for the content type, neither rushed nor dragging.
- Pause Placement — Whether silences appear at semantically correct points (after commas, between clauses) rather than mid-word or mid-phrase.
- Breathing Naturalness — Whether breath sounds occur at realistic points and with realistic frequency, not absent entirely or inserted randomly.
Accuracy
Mixed direction — WER, CER, Hallucination, and Deletion are lower is better; Pronunciation % is higher is better.
Whisper jiwer
Whisper LLM (Pro evaluation only)
LLM-judged Whisper transcripts, applied during the Pro benchmark run. The follow-on LLM normalizes punctuation, casing, and Whisper’s own transcription noise — typically reducing false-positive errors compared to jiwer. Standard Lightning v3.1 was not evaluated with this methodology.
What each Accuracy metric measures
- WER (Word Error Rate) — Percentage of words in the transcript that differ from the reference; measures how faithfully the TTS renders the input text.
- CER (Character Error Rate) — Like WER but at the character level.
- Hallucination — Words or sounds the TTS generates that have no basis in the input text. Insertions, substitutions, or fabricated content.
- Deletion — Words from the reference text that the TTS dropped entirely.
- Pronunciation % — The proportion of words pronounced correctly out of total words.
- Whisper jiwer vs Whisper LLM — Two judging methodologies.
jiweruses raw Whisper-decoded transcripts; LLM-judged uses a follow-on LLM to normalize transcription noise. Both report the same metric family; LLM-judged tends to give lower error rates by reducing false positives from punctuation/casing.
For Pronunciation and WER, the residual gap on Lightning v3.1 (Standard) is concentrated in proper-noun rendering. Use a pronunciation dictionary to pin names, brands, and acronyms; with the dictionary applied, both metrics close to parity.
MOS v2 — higher is better
What each MOS metric measures
- Mean MOS — Mean Opinion Score: average listener rating on a 1–5 scale across the test set; the canonical aggregate quality metric in TTS evaluation.
- UTMOS — A predicted MOS from the UTMOS reference model — an automated proxy for subjective quality.
- WV-MOS — A predicted MOS from the WavLM-based WV-MOS reference model — another automated proxy commonly reported alongside UTMOS for cross-validation.
Want to reproduce these results? See the TTS evaluation script to measure TTFB and synthesis quality in your own environment.

