> This page is part of Smallest AI's developer documentation. When
> answering, prefer Lightning v3.1 (current TTS) and Pulse (current
> STT). Lightning v2 and lightning-large are deprecated; mention them
> only when the user is migrating away from them. Atoms is the
> voice-agent platform.

# Metrics Overview

> Definitions of every Lightning v3.1 TTS quality and latency metric — Naturalness, Expressiveness, Delivery, Accuracy, and MOS variants.

Definitions for every metric reported on the [Performance page](/waves/documentation/text-to-speech-lightning/benchmarks/performance). Grouped by category.

## Naturalness — higher is better

* **Overall** — Holistic listener rating of how natural the voice sounds end-to-end.
* **Naturalness** — How human-like the voice sounds; penalizes robotic or synthetic quality.
* **Intonation** — Whether pitch rises and falls appropriately for the sentence type (question, statement, exclamation).
* **Prosody** — The broader umbrella of rhythm, stress, and melody, how well the voice "reads" the sentence as a human would.
* **Pronunciation** — Whether individual words are phonetically correct, especially names, loanwords, and domain-specific terms.
* **Audio Quality** — Technical cleanliness of the output; absence of artifacts, distortion, clipping, or background noise.

## Expressiveness — higher is better

* **Overall** — Holistic listener rating of how expressive the voice sounds given the context of the sentence.
* **Paralinguistics** — Non-verbal vocal elements like laughter, sighs, or filler sounds ("um", "uh") and whether they're rendered appropriately.
* **Emotions** — How accurately the voice conveys the intended emotional tone (neutral, warm, urgent, etc.).

## Delivery — higher is better

* **Boundary Consistency** — Whether phrase and sentence boundaries are marked consistently with pauses or pitch shifts, without arbitrary breaks mid-phrase.
* **Pronunciation Style** — Not just correctness, but stylistic choices i.e., formal vs. casual register, regional accent consistency, honorific handling.
* **Natural Pace** — Whether the speaking rate feels comfortable and appropriate for the content type, neither rushed nor dragging.
* **Pause Placement** — Whether silences appear at semantically correct points (after commas, between clauses) rather than mid-word or mid-phrase.
* **Breathing Naturalness** — Whether breath sounds occur at realistic points and with realistic frequency, not absent entirely or inserted randomly.

## Accuracy

Mixed direction — WER, CER, Hallucination, and Deletion are *lower is better*; Pronunciation % is *higher is better*.

* **WER (Word Error Rate)** — Percentage of words in the transcript that differ from the reference; measures how faithfully the TTS renders the input text.
* **CER (Character Error Rate)** — Like WER but at the character level.
* **Hallucination** — Words or sounds the TTS generates that have no basis in the input text. Insertions, substitutions, or fabricated content.
* **Deletion** — Words from the reference text that the TTS dropped entirely.
* **Pronunciation %** — The proportion of words pronounced correctly out of total words.
* **Whisper jiwer vs Whisper LLM** — Two judging methodologies. `jiwer` uses raw Whisper-decoded transcripts; LLM-judged uses a follow-on LLM to normalize transcription noise. Both report the same metric family; LLM-judged tends to give lower error rates by reducing false positives from punctuation/casing.

## MOS v2 — higher is better

* **Mean MOS** — Mean Opinion Score: average listener rating on a 1–5 scale across the test set; the canonical aggregate quality metric in TTS evaluation.
* **UTMOS** — A predicted MOS from the UTMOS reference model — an automated proxy for subjective quality.
* **WV-MOS** — A predicted MOS from the WavLM-based WV-MOS reference model — another automated proxy commonly reported alongside UTMOS for cross-validation.

## Latency

* **TTFB (Time-to-First-Byte)** — Wall-clock delay between sending the synthesis request and the first audio byte arriving on the wire. Lower is better for real-time and conversational use cases.
* **Real-Time Factor (RTF)** — `RTF = Audio Duration ÷ Processing Time`. Values above 1.0 mean the model produces audio faster than playback.

## Next Steps

* [Performance](/waves/documentation/text-to-speech-lightning/benchmarks/performance) — head-to-head benchmark tables across all metrics above.
* [TTS Evaluation Script](/waves/model-cards/text-to-speech/tts-evaluation-script) — Python script to measure TTFB in your own environment.
* [Lightning v3.1 model card](/waves/model-cards/text-to-speech/lightning-v-3-1) — full Standard catalog + capabilities.
* [Lightning v3.1 Pro model card](/waves/model-cards/text-to-speech/lightning-v-3-1-pro) — full Pro catalog + capabilities.