Evaluation Walkthrough
Our evaluation guide outlines a repeatable process: choose representative audio, generate transcripts, compute WER/CER/latency, and document findings. Use the streamlined steps below (with ready-to-run snippets) to mirror that workflow.
1. Assemble your dataset matrix
- Collect 50–200 files per use case (support calls, meetings, media, etc.).
- Produce verified transcripts plus optional speaker labels and timestamps.
- Track metadata for accent, language, and audio quality so you can pivot metrics later.
2. Install the evaluation toolkit
smallestai→ Pulse STT clientjiwer→ WER/CER computationwhisper-normalizer→ normalization that matches the official guidance
3. Transcribe + normalize
4. Batch evaluation + aggregation
Recommended metrics to report
- WER / CER per use case and language.
- Time to first result and RTF from
response.metrics. - Diarization coverage: % of
utterancesentries withspeaker.
5. Error analysis
- Classify errors into substitutions, deletions, insertions.
- Highlight audio traits (noise, accent) that correlate with higher WER.
6. Compare configurations
Use this to decide whether to enable diarization, sentence-level timestamps, or enrichment features; the official evaluation doc recommends capturing cost/latency impact alongside accuracy.
7. Publish the report
Include:
- Dataset description + rationale
- Metrics table (WER/CER/TTFR/RTF, p50/p90/p95)
- Error taxonomy with audio snippets
- Configuration recommendation (e.g.,
language=multi,word_timestamps=true,diarize=true) - Follow-up experiments or model versions to track
Example JSON summary
This completes the process of self metric evaluation. With these steps, you can identify strengths and weaknesses in any STT model.

