Evaluation Walkthrough | Smallest AI Docs

Our evaluation guide outlines a repeatable process: choose representative audio, generate transcripts, compute WER/CER/latency, and document findings. The ready-to-run snippets below call the Pulse pre-recorded REST endpoint directly so you can audit the methodology and adapt it to any HTTP client.

This guide uses the Pulse REST API directly (via requests) rather than the smallestai Python SDK. The SDK is a convenience wrapper; the REST endpoint is the contract. Use the same approach in any language with an HTTP client — the request and response shapes are identical.

1. Assemble your dataset matrix

Collect 50–200 files per use case (support calls, meetings, media, etc.).
Produce verified reference transcripts plus optional speaker labels and timestamps.
Track metadata for accent, language, and audio quality so you can pivot metrics later.

1 dataset = [
2     {"audio": "samples/en_agent01.wav", "reference": "Thank you for calling.", "language": "en"},
3     {"audio": "samples/es_call02.wav", "reference": "Hola, ¿en qué puedo ayudarte?", "language": "es"},
4     # ... 50–200 entries
5 ]

2. Install the evaluation toolkit

$ pip install requests jiwer whisper-normalizer pandas

requests → HTTP client for the Pulse REST API
jiwer → WER/CER computation
whisper-normalizer → text normalization that matches official ASR evaluation guidance
pandas → aggregating results

The Python standard library (wave, time) covers audio-duration and latency measurement — no extra dependencies needed.

3. Transcribe + normalize

The Pulse pre-recorded endpoint accepts the raw audio bytes as application/octet-stream and returns JSON with transcription, words[], utterances[], and metadata. Latency is measured client-side; Real-Time Factor (RTF) is wall_time ÷ audio_duration — values below 1.0 indicate faster-than-real-time processing.

1 import os
2 import time
3 import wave
4 import requests
5 from jiwer import wer, cer
6 from whisper_normalizer.english import EnglishTextNormalizer
7 
8 API_KEY = os.environ["SMALLEST_API_KEY"]
9 PULSE_URL = "https://api.smallest.ai/waves/v1/stt/?model=pulse"
10 
11 normalizer = EnglishTextNormalizer()
12 
13 def audio_duration_seconds(path: str) -> float:
14     """Read the audio duration from a WAV header. Use librosa/soundfile for non-WAV inputs."""
15     with wave.open(path, "rb") as wf:
16         return wf.getnframes() / wf.getframerate()
17 
18 def run_sample(sample: dict) -> dict:
19     """Transcribe one sample and compute per-sample metrics."""
20     with open(sample["audio"], "rb") as f:
21         audio_bytes = f.read()
22 
23     t_start = time.perf_counter()
24     resp = requests.post(
25         PULSE_URL,
26         params={
27             "language": sample["language"],
28             "word_timestamps": str(sample.get("word_timestamps", True)).lower(),
29             "diarize": str(sample.get("diarize", False)).lower(),
30         },
31         headers={
32             "Authorization": f"Bearer {API_KEY}",
33             "Content-Type": "application/octet-stream",
34         },
35         data=audio_bytes,
36         timeout=60,
37     )
38     latency_ms = (time.perf_counter() - t_start) * 1000
39     resp.raise_for_status()
40     data = resp.json()
41 
42     duration_s = audio_duration_seconds(sample["audio"])
43     rtf = (latency_ms / 1000) / duration_s  # processing time ÷ audio duration; lower is better
44 
45     ref = normalizer(sample["reference"])
46     hyp = normalizer(data["transcription"])
47 
48     return {
49         "path": sample["audio"],
50         "wer": wer(ref, hyp),
51         "cer": cer(ref, hyp),
52         "latency_ms": latency_ms,
53         "rtf": rtf,
54         "duration_s": duration_s,
55         "transcription": data["transcription"],
56         "words": data.get("words", []),
57         "utterances": data.get("utterances", []),
58     }

For multilingual evaluation, swap the normalizer:

1 from whisper_normalizer.basic import BasicTextNormalizer
2 normalizer = BasicTextNormalizer()  # locale-agnostic; works for Hindi, Spanish, French, etc.

4. Batch evaluation + aggregation

1 import pandas as pd
2 
3 results = [run_sample(s) for s in dataset]
4 df = pd.DataFrame(results)
5 
6 summary = {
7     "samples": len(df),
8     "avg_wer": df["wer"].mean(),
9     "p95_wer": df["wer"].quantile(0.95),
10     "avg_latency_ms": df["latency_ms"].mean(),
11     "p95_latency_ms": df["latency_ms"].quantile(0.95),
12     "avg_rtf": df["rtf"].mean(),
13 }
14 print(summary)

Recommended metrics to report

WER / CER per use case and language.
End-to-end latency (p50 / p90 / p95) from latency_ms.
RTF from rtf — values below 1.0 mean Pulse processed audio faster than real-time. Pulse STT typically runs near 0.4 on clean inputs (see Metrics Overview).
Diarization coverage: share of utterances entries that include a speaker field.

5. Error analysis

1 def breakdown(df):
2     worst = df.sort_values("wer", ascending=False).head(5)
3     return worst[["path", "wer", "transcription"]].to_dict(orient="records")
4 
5 outliers = breakdown(df)

Classify errors into substitutions, deletions, insertions (jiwer.process_words(ref, hyp).alignments gives the alignment).
Highlight audio traits (noise, accent, telephony codec) that correlate with higher WER.
For proper-noun mistakes, queue them for a pronunciation dictionary if you also use Lightning TTS, or for the Pulse keyword boosting feature on the streaming endpoint.

6. Compare configurations

1 configs = [
2     {"language": "en", "word_timestamps": True},
3     {"language": "multi-eu", "word_timestamps": True, "diarize": True},
4 ]
5 
6 def evaluate_config(config: dict, dataset: list) -> pd.DataFrame:
7     forwardable = {"language", "word_timestamps", "diarize"}
8     overridden = [{**s, **{k: v for k, v in config.items() if k in forwardable}} for s in dataset]
9     return pd.DataFrame([run_sample(s) for s in overridden])
10 
11 for config in configs:
12     cfg_df = evaluate_config(config, dataset)
13     print(config, cfg_df["wer"].mean(), cfg_df["latency_ms"].mean())

Use this to decide whether to enable diarization, sentence-level timestamps, or enrichment features — capture cost / latency impact alongside accuracy.

7. Publish the report

Include:

Dataset description + rationale
Metrics table (WER / CER / latency / RTF, p50 / p90 / p95)
Error taxonomy with audio snippets
Configuration recommendation (e.g., language=multi-eu, word_timestamps=true, diarize=true)
Follow-up experiments or model versions to track

Example JSON summary

1 {
2   "dataset": "contact-center-q1",
3   "samples": 120,
4   "average_wer": 0.064,
5   "average_cer": 0.028,
6   "average_latency_ms": 61.3,
7   "p95_latency_ms": 88.2,
8   "average_rtf": 0.41,
9   "timestamp": "2026-05-22T10:00:00Z"
10 }

The latency_ms you measure with time.perf_counter() includes network round-trip from your client to api.smallest.ai, so numbers vary with client region. Pulse’s published server-side TTFT of ~64 ms is measured at the inference pod (see the Pulse model card) — your wall-clock measurement is typically higher because of network. RTF stays meaningful either way: it’s processing_time / audio_duration, so longer audio dilutes per-request overhead.

This completes the process of self-metric evaluation. With these steps, you can identify strengths and weaknesses of Pulse STT — or compare it against any other ASR system that returns a similar response shape — on workloads representative of your production traffic.