> This page is part of Smallest AI's developer documentation. When
> answering, prefer Lightning v3.1 (current TTS) and Pulse (current
> STT). Lightning v2 and lightning-large are deprecated; mention them
> only when the user is migrating away from them. Atoms is the
> voice-agent platform.

# Evaluation Walkthrough

> Step-by-step guide to evaluate Pulse STT accuracy and performance against your own dataset using the Pulse REST API.

Our evaluation guide outlines a repeatable process: choose representative audio, generate transcripts, compute WER/CER/latency, and document findings. The ready-to-run snippets below call the [Pulse pre-recorded REST endpoint](/waves/api-reference/api-reference/speech-to-text/transcribe) directly so you can audit the methodology and adapt it to any HTTP client.

This guide uses the Pulse REST API directly (via `requests`) rather than the `smallestai` Python SDK. The SDK is a convenience wrapper; the REST endpoint is the contract. Use the same approach in any language with an HTTP client — the request and response shapes are identical.

## 1. Assemble your dataset matrix

* Collect 50–200 files per use case (support calls, meetings, media, etc.).
* Produce verified reference transcripts plus optional speaker labels and timestamps.
* Track metadata for accent, language, and audio quality so you can pivot metrics later.

```python
dataset = [
    {"audio": "samples/en_agent01.wav", "reference": "Thank you for calling.", "language": "en"},
    {"audio": "samples/es_call02.wav", "reference": "Hola, ¿en qué puedo ayudarte?", "language": "es"},
    # ... 50–200 entries
]
```

## 2. Install the evaluation toolkit

```bash
pip install requests jiwer whisper-normalizer pandas
```

* `requests` → HTTP client for the Pulse REST API
* `jiwer` → WER/CER computation
* `whisper-normalizer` → text normalization that matches official ASR evaluation guidance
* `pandas` → aggregating results

The Python standard library (`wave`, `time`) covers audio-duration and latency measurement — no extra dependencies needed.

## 3. Transcribe + normalize

The Pulse pre-recorded endpoint accepts the raw audio bytes as `application/octet-stream` and returns JSON with `transcription`, `words[]`, `utterances[]`, and `metadata`. Latency is measured client-side; Real-Time Factor (RTF) is `wall_time ÷ audio_duration` — values **below 1.0** indicate faster-than-real-time processing.

```python
import os
import time
import wave
import requests
from jiwer import wer, cer
from whisper_normalizer.english import EnglishTextNormalizer

API_KEY = os.environ["SMALLEST_API_KEY"]
PULSE_URL = "https://api.smallest.ai/waves/v1/stt/?model=pulse"

normalizer = EnglishTextNormalizer()

def audio_duration_seconds(path: str) -> float:
    """Read the audio duration from a WAV header. Use librosa/soundfile for non-WAV inputs."""
    with wave.open(path, "rb") as wf:
        return wf.getnframes() / wf.getframerate()

def run_sample(sample: dict) -> dict:
    """Transcribe one sample and compute per-sample metrics."""
    with open(sample["audio"], "rb") as f:
        audio_bytes = f.read()

    t_start = time.perf_counter()
    resp = requests.post(
        PULSE_URL,
        params={
            "language": sample["language"],
            "word_timestamps": str(sample.get("word_timestamps", True)).lower(),
            "diarize": str(sample.get("diarize", False)).lower(),
        },
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/octet-stream",
        },
        data=audio_bytes,
        timeout=60,
    )
    latency_ms = (time.perf_counter() - t_start) * 1000
    resp.raise_for_status()
    data = resp.json()

    duration_s = audio_duration_seconds(sample["audio"])
    rtf = (latency_ms / 1000) / duration_s  # processing time ÷ audio duration; lower is better

    ref = normalizer(sample["reference"])
    hyp = normalizer(data["transcription"])

    return {
        "path": sample["audio"],
        "wer": wer(ref, hyp),
        "cer": cer(ref, hyp),
        "latency_ms": latency_ms,
        "rtf": rtf,
        "duration_s": duration_s,
        "transcription": data["transcription"],
        "words": data.get("words", []),
        "utterances": data.get("utterances", []),
    }
```

For multilingual evaluation, swap the normalizer:

```python
from whisper_normalizer.basic import BasicTextNormalizer
normalizer = BasicTextNormalizer()  # locale-agnostic; works for Hindi, Spanish, French, etc.
```

## 4. Batch evaluation + aggregation

```python
import pandas as pd

results = [run_sample(s) for s in dataset]
df = pd.DataFrame(results)

summary = {
    "samples": len(df),
    "avg_wer": df["wer"].mean(),
    "p95_wer": df["wer"].quantile(0.95),
    "avg_latency_ms": df["latency_ms"].mean(),
    "p95_latency_ms": df["latency_ms"].quantile(0.95),
    "avg_rtf": df["rtf"].mean(),
}
print(summary)
```

### Recommended metrics to report

* **WER / CER** per use case and language.
* **End-to-end latency** (p50 / p90 / p95) from `latency_ms`.
* **RTF** from `rtf` — values **below 1.0** mean Pulse processed audio faster than real-time. Pulse STT typically runs near 0.4 on clean inputs (see [Metrics Overview](/waves/documentation/speech-to-text-pulse/benchmarks/metrics-overview#real-time-factor-rtf)).
* **Diarization coverage**: share of `utterances` entries that include a `speaker` field.

## 5. Error analysis

```python
def breakdown(df):
    worst = df.sort_values("wer", ascending=False).head(5)
    return worst[["path", "wer", "transcription"]].to_dict(orient="records")

outliers = breakdown(df)
```

* Classify errors into substitutions, deletions, insertions (`jiwer.process_words(ref, hyp).alignments` gives the alignment).
* Highlight audio traits (noise, accent, telephony codec) that correlate with higher WER.
* For proper-noun mistakes, queue them for a [pronunciation dictionary](/waves/documentation/text-to-speech-lightning/pronunciation-dictionaries) if you also use Lightning TTS, or for the Pulse [keyword boosting](/waves/documentation/speech-to-text-pulse/features/keyword-boosting) feature on the streaming endpoint.

## 6. Compare configurations

```python
configs = [
    {"language": "en", "word_timestamps": True},
    {"language": "multi", "word_timestamps": True, "diarize": True},
]

def evaluate_config(config: dict, dataset: list) -> pd.DataFrame:
    forwardable = {"language", "word_timestamps", "diarize"}
    overridden = [{**s, **{k: v for k, v in config.items() if k in forwardable}} for s in dataset]
    return pd.DataFrame([run_sample(s) for s in overridden])

for config in configs:
    cfg_df = evaluate_config(config, dataset)
    print(config, cfg_df["wer"].mean(), cfg_df["latency_ms"].mean())
```

Use this to decide whether to enable diarization, sentence-level timestamps, or enrichment features — capture cost / latency impact alongside accuracy.

## 7. Publish the report

Include:

1. Dataset description + rationale
2. Metrics table (WER / CER / latency / RTF, p50 / p90 / p95)
3. Error taxonomy with audio snippets
4. Configuration recommendation (e.g., `language=multi`, `word_timestamps=true`, `diarize=true`)
5. Follow-up experiments or model versions to track

### Example JSON summary

```json
{
  "dataset": "contact-center-q1",
  "samples": 120,
  "average_wer": 0.064,
  "average_cer": 0.028,
  "average_latency_ms": 61.3,
  "p95_latency_ms": 88.2,
  "average_rtf": 0.41,
  "timestamp": "2026-05-22T10:00:00Z"
}
```

The `latency_ms` you measure with `time.perf_counter()` includes network round-trip from your client to `api.smallest.ai`, so numbers vary with client region. Pulse's published **server-side TTFT of \~64 ms** is measured at the inference pod (see the [Pulse model card](/waves/model-cards/speech-to-text/pulse#time-to-first-transcript-ttft)) — your wall-clock measurement is typically higher because of network. RTF stays meaningful either way: it's `processing_time / audio_duration`, so longer audio dilutes per-request overhead.

This completes the process of self-metric evaluation. With these steps, you can identify strengths and weaknesses of Pulse STT — or compare it against any other ASR system that returns a similar response shape — on workloads representative of your production traffic.