For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
      • Performance
      • Metrics Overview
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Naturalness — higher is better
  • Expressiveness — higher is better
  • Delivery — higher is better
  • Accuracy
  • MOS v2 — higher is better
  • Latency
  • Next Steps
Text to Speech (Lightning)Benchmarks

Metrics Overview

||View as Markdown|
Was this page helpful?
Previous

Performance

Next

Quickstart

Built with

Definitions for every metric reported on the Performance page. Grouped by category.

Naturalness — higher is better

  • Overall — Holistic listener rating of how natural the voice sounds end-to-end.
  • Naturalness — How human-like the voice sounds; penalizes robotic or synthetic quality.
  • Intonation — Whether pitch rises and falls appropriately for the sentence type (question, statement, exclamation).
  • Prosody — The broader umbrella of rhythm, stress, and melody, how well the voice “reads” the sentence as a human would.
  • Pronunciation — Whether individual words are phonetically correct, especially names, loanwords, and domain-specific terms.
  • Audio Quality — Technical cleanliness of the output; absence of artifacts, distortion, clipping, or background noise.

Expressiveness — higher is better

  • Overall — Holistic listener rating of how expressive the voice sounds given the context of the sentence.
  • Paralinguistics — Non-verbal vocal elements like laughter, sighs, or filler sounds (“um”, “uh”) and whether they’re rendered appropriately.
  • Emotions — How accurately the voice conveys the intended emotional tone (neutral, warm, urgent, etc.).

Delivery — higher is better

  • Boundary Consistency — Whether phrase and sentence boundaries are marked consistently with pauses or pitch shifts, without arbitrary breaks mid-phrase.
  • Pronunciation Style — Not just correctness, but stylistic choices i.e., formal vs. casual register, regional accent consistency, honorific handling.
  • Natural Pace — Whether the speaking rate feels comfortable and appropriate for the content type, neither rushed nor dragging.
  • Pause Placement — Whether silences appear at semantically correct points (after commas, between clauses) rather than mid-word or mid-phrase.
  • Breathing Naturalness — Whether breath sounds occur at realistic points and with realistic frequency, not absent entirely or inserted randomly.

Accuracy

Mixed direction — WER, CER, Hallucination, and Deletion are lower is better; Pronunciation % is higher is better.

  • WER (Word Error Rate) — Percentage of words in the transcript that differ from the reference; measures how faithfully the TTS renders the input text.
  • CER (Character Error Rate) — Like WER but at the character level.
  • Hallucination — Words or sounds the TTS generates that have no basis in the input text. Insertions, substitutions, or fabricated content.
  • Deletion — Words from the reference text that the TTS dropped entirely.
  • Pronunciation % — The proportion of words pronounced correctly out of total words.
  • Whisper jiwer vs Whisper LLM — Two judging methodologies. jiwer uses raw Whisper-decoded transcripts; LLM-judged uses a follow-on LLM to normalize transcription noise. Both report the same metric family; LLM-judged tends to give lower error rates by reducing false positives from punctuation/casing.

MOS v2 — higher is better

  • Mean MOS — Mean Opinion Score: average listener rating on a 1–5 scale across the test set; the canonical aggregate quality metric in TTS evaluation.
  • UTMOS — A predicted MOS from the UTMOS reference model — an automated proxy for subjective quality.
  • WV-MOS — A predicted MOS from the WavLM-based WV-MOS reference model — another automated proxy commonly reported alongside UTMOS for cross-validation.

Latency

  • TTFB (Time-to-First-Byte) — Wall-clock delay between sending the synthesis request and the first audio byte arriving on the wire. Lower is better for real-time and conversational use cases.
  • Real-Time Factor (RTF) — RTF = Audio Duration ÷ Processing Time. Values above 1.0 mean the model produces audio faster than playback.

Next Steps

  • Performance — head-to-head benchmark tables across all metrics above.
  • TTS Evaluation Script — Python script to measure TTFB in your own environment.
  • Lightning v3.1 model card — full Standard catalog + capabilities.
  • Lightning v3.1 Pro model card — full Pro catalog + capabilities.