Evaluation Walkthrough

View as MarkdownOpen in Claude

Our evaluation guide outlines a repeatable process: choose representative audio, generate transcripts, compute WER/CER/latency, and document findings. Use the streamlined steps below (with ready-to-run snippets) to mirror that workflow.

1. Assemble your dataset matrix

  • Collect 50–200 files per use case (support calls, meetings, media, etc.).
  • Produce verified transcripts plus optional speaker labels and timestamps.
  • Track metadata for accent, language, and audio quality so you can pivot metrics later.
1dataset = [
2 {"audio": "samples/en_agent01.wav", "reference": "Thank you for calling.", "language": "en"},
3 {"audio": "samples/es_call02.wav", "reference": "Hola, ¿en qué puedo ayudarte?", "language": "es"},
4]

2. Install the evaluation toolkit

$pip install smallestai jiwer whisper-normalizer pandas
  • smallestai → Pulse STT client
  • jiwer → WER/CER computation
  • whisper-normalizer → normalization that matches the official guidance

3. Transcribe + normalize

1import os
2from jiwer import wer, cer
3from whisper_normalizer.english import EnglishTextNormalizer
4from smallestai.waves import WavesClient
5
6client = WavesClient(api_key=os.environ["SMALLEST_AI_API_KEY"])
7normalizer = EnglishTextNormalizer()
8
9def run_sample(sample):
10 response = client.transcribe(
11 audio_file=sample["audio"],
12 language=sample["language"],
13 word_timestamps=True,
14 diarize=True
15 )
16 ref = normalizer(sample["reference"])
17 hyp = normalizer(response.transcription)
18 return {
19 "path": sample["audio"],
20 "wer": wer(ref, hyp),
21 "cer": cer(ref, hyp),
22 "latency_ms": response.metrics["latency_ms"],
23 "rtf": response.metrics["real_time_factor"],
24 "transcription": response.transcription
25 }

4. Batch evaluation + aggregation

1import pandas as pd
2
3results = [run_sample(s) for s in dataset]
4df = pd.DataFrame(results)
5
6summary = {
7 "samples": len(df),
8 "avg_wer": df.wer.mean(),
9 "p95_wer": df.wer.quantile(0.95),
10 "avg_latency_ms": df.latency_ms.mean(),
11 "avg_rtf": df.rtf.mean()
12}
  • WER / CER per use case and language.
  • Time to first result and RTF from response.metrics.
  • Diarization coverage: % of utterances entries with speaker.

5. Error analysis

1def breakdown(df):
2 worst = df.sort_values("wer", ascending=False).head(5)[["path", "wer", "transcription"]]
3 return worst.to_dict(orient="records")
4
5outliers = breakdown(df)
  • Classify errors into substitutions, deletions, insertions.
  • Highlight audio traits (noise, accent) that correlate with higher WER.

6. Compare configurations

1configs = [
2 {"language": "en", "word_timestamps": True},
3 {"language": "multi", "word_timestamps": True, "diarize": True}
4]
5
6def evaluate_config(config):
7 return [run_sample({**s, **config}) for s in dataset]
8
9for config in configs:
10 cfg_results = pd.DataFrame(evaluate_config(config))
11 print(config, cfg_results.wer.mean())

Use this to decide whether to enable diarization, sentence-level timestamps, or enrichment features; the official evaluation doc recommends capturing cost/latency impact alongside accuracy.

7. Publish the report

Include:

  1. Dataset description + rationale
  2. Metrics table (WER/CER/TTFR/RTF, p50/p90/p95)
  3. Error taxonomy with audio snippets
  4. Configuration recommendation (e.g., language=multi, word_timestamps=true, diarize=true)
  5. Follow-up experiments or model versions to track

Example JSON summary

1{
2 "dataset": "contact-center-q1",
3 "samples": 120,
4 "average_wer": 0.064,
5 "average_cer": 0.028,
6 "average_latency_ms": 61.3,
7 "average_rtf": 0.41,
8 "p95_latency_ms": 88.2,
9 "timestamp": "2025-01-15T10:00:00Z"
10}

This completes the process of self metric evaluation. With these steps, you can identify strengths and weaknesses in any STT model.