***

title: Evaluation Walkthrough
description: Step-by-step guide to evaluate Pulse STT accuracy and performance
---------------------

For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.smallest.ai/waves/v-4-0-0/documentation/speech-to-text-pulse/benchmarks/llms.txt. For full documentation content, see https://docs.smallest.ai/waves/v-4-0-0/documentation/speech-to-text-pulse/benchmarks/llms-full.txt.

Our evaluation guide outlines a repeatable process: choose representative audio, generate transcripts, compute WER/CER/latency, and document findings. Use the streamlined steps below (with ready-to-run snippets) to mirror that workflow.

## 1. Assemble your dataset matrix

* Collect 50–200 files per use case (support calls, meetings, media, etc.).
* Produce verified transcripts plus optional speaker labels and timestamps.
* Track metadata for accent, language, and audio quality so you can pivot metrics later.

```python
dataset = [
    {"audio": "samples/en_agent01.wav", "reference": "Thank you for calling.", "language": "en"},
    {"audio": "samples/es_call02.wav", "reference": "Hola, ¿en qué puedo ayudarte?", "language": "es"},
]
```

## 2. Install the evaluation toolkit

```bash
pip install smallestai jiwer whisper-normalizer pandas
```

* `smallestai` → Pulse STT client
* `jiwer` → WER/CER computation
* `whisper-normalizer` → normalization that matches the official guidance

## 3. Transcribe + normalize

```python
import os
from jiwer import wer, cer
from whisper_normalizer.english import EnglishTextNormalizer
from smallestai.waves import WavesClient

client = WavesClient(api_key=os.environ["SMALLEST_AI_API_KEY"])
normalizer = EnglishTextNormalizer()

def run_sample(sample):
    response = client.transcribe(
        audio_file=sample["audio"],
        language=sample["language"],
        word_timestamps=True,
        diarize=True
    )
    ref = normalizer(sample["reference"])
    hyp = normalizer(response.transcription)
    return {
        "path": sample["audio"],
        "wer": wer(ref, hyp),
        "cer": cer(ref, hyp),
        "latency_ms": response.metrics["latency_ms"],
        "rtf": response.metrics["real_time_factor"],
        "transcription": response.transcription
    }
```

## 4. Batch evaluation + aggregation

```python
import pandas as pd

results = [run_sample(s) for s in dataset]
df = pd.DataFrame(results)

summary = {
    "samples": len(df),
    "avg_wer": df.wer.mean(),
    "p95_wer": df.wer.quantile(0.95),
    "avg_latency_ms": df.latency_ms.mean(),
    "avg_rtf": df.rtf.mean()
}
```

### Recommended metrics to report

* **WER / CER** per use case and language.
* **Time to first result** and **RTF** from `response.metrics`.
* **Diarization coverage**: % of `utterances` entries with `speaker`.

## 5. Error analysis

```python
def breakdown(df):
    worst = df.sort_values("wer", ascending=False).head(5)[["path", "wer", "transcription"]]
    return worst.to_dict(orient="records")

outliers = breakdown(df)
```

* Classify errors into substitutions, deletions, insertions.
* Highlight audio traits (noise, accent) that correlate with higher WER.

## 6. Compare configurations

```python
configs = [
    {"language": "en", "word_timestamps": True},
    {"language": "multi", "word_timestamps": True, "diarize": True}
]

def evaluate_config(config):
    return [run_sample({**s, **config}) for s in dataset]

for config in configs:
    cfg_results = pd.DataFrame(evaluate_config(config))
    print(config, cfg_results.wer.mean())
```

Use this to decide whether to enable diarization, sentence-level timestamps, or enrichment features; the official evaluation doc recommends capturing cost/latency impact alongside accuracy.

## 7. Publish the report

Include:

1. Dataset description + rationale
2. Metrics table (WER/CER/TTFR/RTF, p50/p90/p95)
3. Error taxonomy with audio snippets
4. Configuration recommendation (e.g., `language=multi`, `word_timestamps=true`, `diarize=true`)
5. Follow-up experiments or model versions to track

### Example JSON summary

```json
{
  "dataset": "contact-center-q1",
  "samples": 120,
  "average_wer": 0.064,
  "average_cer": 0.028,
  "average_latency_ms": 61.3,
  "average_rtf": 0.41,
  "p95_latency_ms": 88.2,
  "timestamp": "2025-01-15T10:00:00Z"
}
```

This completes the process of self metric evaluation. With these steps, you can identify strengths and weaknesses in any STT model.