***

title: Metrics Overview
description: Key Pulse STT metrics for quality and latency.
---------------------

For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.smallest.ai/waves/v-4-0-0/documentation/speech-to-text-pulse/benchmarks/llms.txt. For full documentation content, see https://docs.smallest.ai/waves/v-4-0-0/documentation/speech-to-text-pulse/benchmarks/llms-full.txt.

Pulse STT evaluations revolve around four pillars:

1. **Accuracy** – how close transcripts are to the ground truth.
2. **Latency & throughput** – how quickly and efficiently results arrive.
3. **Enrichment quality** – how reliable diarization, timestamps, and metadata are.

## Accuracy metrics

### Word Error Rate (WER)

* Formula: `WER = (Substitutions + Deletions + Insertions) / Total Words`.
* Interprets overall transcript fidelity; normalize casing/punctuation before computing.

### Character Error Rate (CER)

* Parallel to WER but at the character level; useful for languages with compact scripts or heavy compounding.

### Sentence accuracy

* Percentage of sentences that match ground truth exactly.
* Tracks readability for QA/customer-support recaps.

## Latency & throughput

### Time to First Result (TTFR)

* Measures the delay between request start and first interim token.
* For real-time agents, keep TTFR below \~30 ms to maintain natural turn-taking.

### End-to-end latency

* Wall-clock time from submission to final transcript.
* Report p50/p90/p95 to capture outliers introduced by long files or retries.

### Real-Time Factor (RTF)

* `RTF = Processing Time / Audio Duration`.
* Values less than 1 indicate faster-than-real-time processing; Pulse STT typically runs near 0.4 RTF on clean inputs.

## Enrichment quality

<table>
  <thead>
    <tr>
      <th>
        Metric
      </th>

      <th>
        What to watch
      </th>

      <th>
        Why it matters
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        Diarization accuracy
      </td>

      <td>
        % of words with correct 

        <code>speaker_id</code>
      </td>

      <td>
        Call-center QA, coaching, compliance
      </td>
    </tr>

    <tr>
      <td>
        Word timestamp drift
      </td>

      <td>
        Gap between predicted and reference timestamps
      </td>

      <td>
        Subtitle alignment and editing
      </td>
    </tr>

    <tr>
      <td>
        Sentence-level timestamps
      </td>

      <td>
        % of audio covered by 

        <code>utterances</code>

         segments
      </td>

      <td>
        Chaptering, meeting notes
      </td>
    </tr>

    <tr>
      <td>
        Emotion/age/gender precision
      </td>

      <td>
        Confidence distribution
      </td>

      <td>
        Routing, analytics, compliance flags
      </td>
    </tr>
  </tbody>
</table>

## Coverage & robustness

* **Language detection accuracy**: share of files that land on the intended ISO 639-1 code or auto-detected language.
* **Noise robustness**: WER delta at multiple SNR levels (clean vs. +5 dB noise, etc.).
* **Accent/domain diversity**: track WER per accent or scenario (support, media, meetings) to avoid blind spots.

## Operational metrics

* **Requests per second / concurrent sessions**: validate you stay within quota and plan scaling needs.
* **Cost per minute**: Pulse STT bills per second at \$0.025/minute list price—include enrichment toggles when modeling cost.
* **Retry volume**: differentiate infrastructure retries (HTTP 5xx) from transcription failures to spot upstream vs downstream issues.

## Reporting checklist

1. Describe dataset composition (language, accent, domain, duration).
2. Publish WER/CER, TTFR, and RTF with averages and percentiles.
3. Include enrichment coverage (how many segments include diarization/timestamps).
4. Summarize cost/latency impact when enabling optional features.
5. Link to reproducible scripts or notebooks for auditing.