Evaluation Walkthrough
Our evaluation guide outlines a repeatable process: choose representative audio, generate transcripts, compute WER/CER/latency, and document findings. The ready-to-run snippets below call the Pulse pre-recorded REST endpoint directly so you can audit the methodology and adapt it to any HTTP client.
This guide uses the Pulse REST API directly (via requests) rather than the smallestai Python SDK. The SDK is a convenience wrapper; the REST endpoint is the contract. Use the same approach in any language with an HTTP client — the request and response shapes are identical.
1. Assemble your dataset matrix
- Collect 50–200 files per use case (support calls, meetings, media, etc.).
- Produce verified reference transcripts plus optional speaker labels and timestamps.
- Track metadata for accent, language, and audio quality so you can pivot metrics later.
2. Install the evaluation toolkit
requests→ HTTP client for the Pulse REST APIjiwer→ WER/CER computationwhisper-normalizer→ text normalization that matches official ASR evaluation guidancepandas→ aggregating results
The Python standard library (wave, time) covers audio-duration and latency measurement — no extra dependencies needed.
3. Transcribe + normalize
The Pulse pre-recorded endpoint accepts the raw audio bytes as application/octet-stream and returns JSON with transcription, words[], utterances[], and metadata. Latency is measured client-side; Real-Time Factor (RTF) is wall_time ÷ audio_duration — values below 1.0 indicate faster-than-real-time processing.
For multilingual evaluation, swap the normalizer:
4. Batch evaluation + aggregation
Recommended metrics to report
- WER / CER per use case and language.
- End-to-end latency (p50 / p90 / p95) from
latency_ms. - RTF from
rtf— values below 1.0 mean Pulse processed audio faster than real-time. Pulse STT typically runs near 0.4 on clean inputs (see Metrics Overview). - Diarization coverage: share of
utterancesentries that include aspeakerfield.
5. Error analysis
- Classify errors into substitutions, deletions, insertions (
jiwer.process_words(ref, hyp).alignmentsgives the alignment). - Highlight audio traits (noise, accent, telephony codec) that correlate with higher WER.
- For proper-noun mistakes, queue them for a pronunciation dictionary if you also use Lightning TTS, or for the Pulse keyword boosting feature on the streaming endpoint.
6. Compare configurations
Use this to decide whether to enable diarization, sentence-level timestamps, or enrichment features — capture cost / latency impact alongside accuracy.
7. Publish the report
Include:
- Dataset description + rationale
- Metrics table (WER / CER / latency / RTF, p50 / p90 / p95)
- Error taxonomy with audio snippets
- Configuration recommendation (e.g.,
language=multi,word_timestamps=true,diarize=true) - Follow-up experiments or model versions to track
Example JSON summary
The latency_ms you measure with time.perf_counter() includes network round-trip from your client to api.smallest.ai, so numbers vary with client region. Pulse’s published server-side TTFT of ~64 ms is measured at the inference pod (see the Pulse model card) — your wall-clock measurement is typically higher because of network. RTF stays meaningful either way: it’s processing_time / audio_duration, so longer audio dilutes per-request overhead.
This completes the process of self-metric evaluation. With these steps, you can identify strengths and weaknesses of Pulse STT — or compare it against any other ASR system that returns a similar response shape — on workloads representative of your production traffic.

