Performance | Smallest AI Docs

Head-to-head listener evaluation against production TTS systems on the EmergentTTS benchmark, 1,088 samples scored by the LLM-as-a-Judge framework. Tables below pair Lightning v3.1 (Standard) and Lightning v3.1 Pro with the same competitor set across Naturalness, Expressiveness, Delivery, Accuracy, and MOS v2 categories. Open the accordion under each one to see what each metric measures, or read the full Metrics Overview.

Latency

Time-to-First-Byte (TTFB)

TTFB measures the wall-clock delay between sending the synthesis request and the first audio byte arriving on the wire. Lower is better for real-time and conversational use cases.

Model	TTFB	Conditions
Lightning v3.1 (Standard)	~200 ms	40 concurrent requests, WebSocket streaming
Lightning v3.1 Pro	~200 ms	40 concurrent requests, WebSocket streaming, dedicated Pro pool

Real-Time Factor (RTF)

RTF = Audio Duration ÷ Processing Time. Values above 1.0 mean the model produces audio faster than playback. Both Standard and Pro run at 3.3× real-time on NVIDIA L40S, so a 10-second utterance is fully synthesized in ~3 seconds.

Head-to-head listener ratings (Lightning v3.1 Standard)

Direct head-to-head ratings on the EmergentTTS benchmark. Lightning Wins % is the share of samples where listeners preferred Lightning v3.1 over the competitor; Ties % is the share where both were rated equal; Competitor Wins % is the inverse. Each competitor column sums to 100%.

EmergentTTS	GPT-4o-mini _OpenAI	Turbo v2.5 _ElevenLabs	Multilingual v2 _ElevenLabs	Sonic-3 _Cartesia	Gemini 2.5 Pro _Google	MAI-Voice-1 _Microsoft	Inworld 1.5 _Inworld	S2 Pro _{Fish Audio}
Lightning Wins % (higher better)	40.26%	50.28%	54.41%	68.29%	58.43%	57.17%	54.41%	64.25%
Ties %	24.17%	25.00%	23.81%	17.00%	8.29%	17.00%	18.11%	13.60%
Competitor Wins % (lower better)	35.57%	24.72%	21.78%	14.71%	33.27%	25.83%	27.48%	22.15%

Naturalness — higher is better

Metric	Lightning v3.1	Lightning v3.1 Pro	GPT-4o-mini	ElevenLabs Turbo v2.5	ElevenLabs Multilingual v2	Sonic-3	Gemini 2.5 Pro	Gemini 2.5 Flash	MAI-Voice-1	Inworld 1.5	S2 Pro
Overall	3.25	3.16	3.13	3.16	3.17	3.20	3.07	3.28	3.17	3.06	3.02
Naturalness	2.61	2.55	2.41	2.52	2.55	2.57	2.42	2.58	2.57	2.41	2.37
Intonation	3.22	3.06	3.06	3.07	3.06	3.12	2.90	3.28	3.04	2.91	2.86
Prosody	3.01	2.81	2.73	2.82	2.86	2.83	2.65	3.09	2.76	2.61	2.58
Pronunciation*	3.63	NA	3.67	3.64	3.65	3.67	3.67	NA	3.68	3.68	3.57
Audio Quality	3.76	NA	3.78	3.77	3.75	3.81	3.73	NA	3.79	3.70	3.75

What each Naturalness metric measures

Overall — Holistic listener rating of how natural the voice sounds end-to-end.
Naturalness — How human-like the voice sounds; penalizes robotic or synthetic quality.
Intonation — Whether pitch rises and falls appropriately for the sentence type (question, statement, exclamation).
Prosody — The broader umbrella of rhythm, stress, and melody, how well the voice “reads” the sentence as a human would.
Pronunciation — Whether individual words are phonetically correct, especially names, loanwords, and domain-specific terms.
Audio Quality — Technical cleanliness of the output; absence of artifacts, distortion, clipping, or background noise.

_{*Listener-rated Pronunciation and Audio Quality columns were measured only on the Standard evaluation; Pro’s Whisper-judged Pronunciation % appears under Accuracy below.}

Expressiveness — higher is better

Metric	Lightning v3.1	Lightning v3.1 Pro	GPT-4o-mini	ElevenLabs Turbo v2.5	ElevenLabs Multilingual v2	Sonic-3	Gemini 2.5 Pro	Gemini 2.5 Flash	MAI-Voice-1	Inworld 1.5	S2 Pro
Overall	3.45	3.55	3.45	3.44	3.46	3.38	3.49	3.54	3.50	3.37	3.41
Paralinguistics	3.61	3.64	3.60	3.59	3.61	3.56	3.60	3.64	3.58	3.55	3.58
Emotions	3.29	3.47	3.30	3.28	3.31	3.19	3.38	3.44	3.41	3.19	3.23

What each Expressiveness metric measures

Overall — Holistic listener rating of how expressive the voice sounds given the context of the sentence.
Paralinguistics — Non-verbal vocal elements like laughter, sighs, or filler sounds (“um”, “uh”) and whether they’re rendered appropriately.
Emotions — How accurately the voice conveys the intended emotional tone (neutral, warm, urgent, etc.).

Delivery — higher is better

Metric	Lightning v3.1	Lightning v3.1 Pro	GPT-4o-mini	ElevenLabs Turbo v2.5	ElevenLabs Multilingual v2	Sonic-3	Gemini 2.5 Pro	Gemini 2.5 Flash	MAI-Voice-1	Inworld 1.5	S2 Pro
Boundary Consistency	4.94	4.96	4.94	4.93	4.95	4.93	4.88	4.99	4.77	4.90	4.88
Pronunciation Style	4.94	4.98	4.96	4.95	4.96	4.96	4.93	4.99	4.91	4.94	4.89
Natural Pace	4.47	4.72	4.57	4.51	4.51	4.01	4.23	4.66	4.47	4.33	3.74
Pause Placement	4.46	4.66	4.54	4.49	4.51	4.28	4.34	4.59	4.41	4.38	4.09
Breathing Naturalness	3.82	3.82	3.06	3.14	3.14	2.79	2.88	3.43	3.28	2.77	2.42

What each Delivery metric measures

Boundary Consistency — Whether phrase and sentence boundaries are marked consistently with pauses or pitch shifts, without arbitrary breaks mid-phrase.
Pronunciation Style — Not just correctness, but stylistic choices i.e., formal vs. casual register, regional accent consistency, honorific handling.
Natural Pace — Whether the speaking rate feels comfortable and appropriate for the content type, neither rushed nor dragging.
Pause Placement — Whether silences appear at semantically correct points (after commas, between clauses) rather than mid-word or mid-phrase.
Breathing Naturalness — Whether breath sounds occur at realistic points and with realistic frequency, not absent entirely or inserted randomly.

Accuracy

Mixed direction — WER, CER, Hallucination, and Deletion are lower is better; Pronunciation % is higher is better.

Whisper jiwer

Metric	Direction	Lightning v3.1	Lightning v3.1 Pro	GPT-4o-mini	ElevenLabs Turbo v2.5	ElevenLabs Multilingual v2	Sonic-3	Gemini 2.5 Pro	Gemini 2.5 Flash	MAI-Voice-1	Inworld 1.5	S2 Pro
WER	lower	1.57%	1.36%	1.26%	1.35%	1.33%	1.43%	1.26%	1.37%	1.25%	1.10%	2.83%
CER	lower	0.67%	0.40%	0.52%	0.60%	0.54%	0.59%	0.62%	0.61%	0.50%	0.47%	1.16%
Hallucination	lower	0.03%	0.00%	0.07%	0.08%	0.01%	0.06%	0.04%	0.01%	0.06%	0.00%	0.22%
Deletion	lower	NA	0.00%	0.14%	0.17%	0.18%	0.16%	0.24%	0.18%	0.15%	0.12%	0.33%
Pronunciation % _{Whisper jiwer}	higher	98.61%	98.68%	98.94%	98.90%	98.87%	98.79%	99.02%	98.82%	98.95%	99.02%	97.72%

Whisper LLM (Pro evaluation only)

LLM-judged Whisper transcripts, applied during the Pro benchmark run. The follow-on LLM normalizes punctuation, casing, and Whisper’s own transcription noise — typically reducing false-positive errors compared to jiwer. Standard Lightning v3.1 was not evaluated with this methodology.

Metric	Direction	Lightning v3.1 Pro	GPT-4o-mini	ElevenLabs Turbo v2.5	ElevenLabs Multilingual v2	Sonic-3	Gemini 2.5 Pro	Gemini 2.5 Flash	MAI-Voice-1	Inworld 1.5	S2 Pro
WER	lower	0.96%	0.82%	0.72%	0.57%	0.88%	0.70%	0.72%	0.60%	0.55%	2.15%
CER	lower	0.34%	0.30%	0.28%	0.21%	0.30%	0.35%	0.33%	0.23%	0.18%	1.03%
Hallucination	lower	0.00%	0.07%	0.07%	0.00%	0.02%	0.02%	0.01%	0.03%	0.00%	0.10%
Pronunciation % _{Whisper LLM}	higher	99.04%	99.25%	99.35%	99.43%	99.14%	99.32%	99.29%	99.43%	99.45%	97.95%

What each Accuracy metric measures

WER (Word Error Rate) — Percentage of words in the transcript that differ from the reference; measures how faithfully the TTS renders the input text.
CER (Character Error Rate) — Like WER but at the character level.
Hallucination — Words or sounds the TTS generates that have no basis in the input text. Insertions, substitutions, or fabricated content.
Deletion — Words from the reference text that the TTS dropped entirely.
Pronunciation % — The proportion of words pronounced correctly out of total words.
Whisper jiwer vs Whisper LLM — Two judging methodologies. jiwer uses raw Whisper-decoded transcripts; LLM-judged uses a follow-on LLM to normalize transcription noise. Both report the same metric family; LLM-judged tends to give lower error rates by reducing false positives from punctuation/casing.

For Pronunciation and WER, the residual gap on Lightning v3.1 (Standard) is concentrated in proper-noun rendering. Use a pronunciation dictionary to pin names, brands, and acronyms; with the dictionary applied, both metrics close to parity.

MOS v2 — higher is better

Metric	Lightning v3.1	Lightning v3.1 Pro	GPT-4o-mini	ElevenLabs Turbo v2.5	ElevenLabs Multilingual v2	Sonic-3	Gemini 2.5 Pro	Gemini 2.5 Flash	MAI-Voice-1	Inworld 1.5	S2 Pro
Mean MOS	NA	4.22	4.16	3.98	4.02	3.76	4.11	4.24	3.97	3.73	3.99
UTMOS	NA	3.76	3.76	3.37	3.41	2.77	3.57	3.71	3.33	2.54	3.50
WV-MOS	4.71	5.05	4.55	4.60	4.63	4.76	4.65	4.76	4.62	4.91	4.48

What each MOS metric measures

Mean MOS — Mean Opinion Score: average listener rating on a 1–5 scale across the test set; the canonical aggregate quality metric in TTS evaluation.
UTMOS — A predicted MOS from the UTMOS reference model — an automated proxy for subjective quality.
WV-MOS — A predicted MOS from the WavLM-based WV-MOS reference model — another automated proxy commonly reported alongside UTMOS for cross-validation.

Want to reproduce these results? See the TTS evaluation script to measure TTFB and synthesis quality in your own environment.