Evaluation Walkthrough | Smallest AI Docs

Our evaluation guide outlines a repeatable process: choose representative audio, generate transcripts, compute WER/CER/latency, and document findings. Use the streamlined steps below (with ready-to-run snippets) to mirror that workflow.

1. Assemble your dataset matrix

Collect 50–200 files per use case (support calls, meetings, media, etc.).
Produce verified transcripts plus optional speaker labels and timestamps.
Track metadata for accent, language, and audio quality so you can pivot metrics later.

1 dataset = [
2     {"audio": "samples/en_agent01.wav", "reference": "Thank you for calling.", "language": "en"},
3     {"audio": "samples/es_call02.wav", "reference": "Hola, ¿en qué puedo ayudarte?", "language": "es"},
4 ]

2. Install the evaluation toolkit

$ pip install smallestai jiwer whisper-normalizer pandas

smallestai → Pulse STT client
jiwer → WER/CER computation
whisper-normalizer → normalization that matches the official guidance

3. Transcribe + normalize

1 import os
2 from jiwer import wer, cer
3 from whisper_normalizer.english import EnglishTextNormalizer
4 from smallestai.waves import WavesClient
5 
6 client = WavesClient(api_key=os.environ["SMALLEST_AI_API_KEY"])
7 normalizer = EnglishTextNormalizer()
8 
9 def run_sample(sample):
10     response = client.transcribe(
11         audio_file=sample["audio"],
12         language=sample["language"],
13         word_timestamps=True,
14         diarize=True
15     )
16     ref = normalizer(sample["reference"])
17     hyp = normalizer(response.transcription)
18     return {
19         "path": sample["audio"],
20         "wer": wer(ref, hyp),
21         "cer": cer(ref, hyp),
22         "latency_ms": response.metrics["latency_ms"],
23         "rtf": response.metrics["real_time_factor"],
24         "transcription": response.transcription
25     }

4. Batch evaluation + aggregation

1 import pandas as pd
2 
3 results = [run_sample(s) for s in dataset]
4 df = pd.DataFrame(results)
5 
6 summary = {
7     "samples": len(df),
8     "avg_wer": df.wer.mean(),
9     "p95_wer": df.wer.quantile(0.95),
10     "avg_latency_ms": df.latency_ms.mean(),
11     "avg_rtf": df.rtf.mean()
12 }

Recommended metrics to report

WER / CER per use case and language.
Time to first result and RTF from response.metrics.
Diarization coverage: % of utterances entries with speaker.

5. Error analysis

1 def breakdown(df):
2     worst = df.sort_values("wer", ascending=False).head(5)[["path", "wer", "transcription"]]
3     return worst.to_dict(orient="records")
4 
5 outliers = breakdown(df)

Classify errors into substitutions, deletions, insertions.
Highlight audio traits (noise, accent) that correlate with higher WER.

6. Compare configurations

1 configs = [
2     {"language": "en", "word_timestamps": True},
3     {"language": "multi", "word_timestamps": True, "diarize": True}
4 ]
5 
6 def evaluate_config(config):
7     return [run_sample({**s, **config}) for s in dataset]
8 
9 for config in configs:
10     cfg_results = pd.DataFrame(evaluate_config(config))
11     print(config, cfg_results.wer.mean())

Use this to decide whether to enable diarization, sentence-level timestamps, or enrichment features; the official evaluation doc recommends capturing cost/latency impact alongside accuracy.

7. Publish the report

Include:

Dataset description + rationale
Metrics table (WER/CER/TTFR/RTF, p50/p90/p95)
Error taxonomy with audio snippets
Configuration recommendation (e.g., language=multi, word_timestamps=true, diarize=true)
Follow-up experiments or model versions to track

Example JSON summary

1 {
2   "dataset": "contact-center-q1",
3   "samples": 120,
4   "average_wer": 0.064,
5   "average_cer": 0.028,
6   "average_latency_ms": 61.3,
7   "average_rtf": 0.41,
8   "p95_latency_ms": 88.2,
9   "timestamp": "2025-01-15T10:00:00Z"
10 }

This completes the process of self metric evaluation. With these steps, you can identify strengths and weaknesses in any STT model.