Our evaluation guide outlines a repeatable process: choose representative audio, generate transcripts, compute WER/CER/latency, and document findings. The ready-to-run snippets below call the Pulse pre-recorded REST endpoint directly so you can audit the methodology and adapt it to any HTTP client.
This guide uses the Pulse REST API directly (via requests) rather than the smallestai Python SDK. The SDK is a convenience wrapper; the REST endpoint is the contract. Use the same approach in any language with an HTTP client — the request and response shapes are identical.
requests → HTTP client for the Pulse REST APIjiwer → WER/CER computationwhisper-normalizer → text normalization that matches official ASR evaluation guidancepandas → aggregating resultsThe Python standard library (wave, time) covers audio-duration and latency measurement — no extra dependencies needed.
The Pulse pre-recorded endpoint accepts the raw audio bytes as application/octet-stream and returns JSON with transcription, words[], utterances[], and metadata. Latency is measured client-side; Real-Time Factor (RTF) is wall_time ÷ audio_duration — values below 1.0 indicate faster-than-real-time processing.
For multilingual evaluation, swap the normalizer:
latency_ms.rtf — values below 1.0 mean Pulse processed audio faster than real-time. Pulse STT typically runs near 0.4 on clean inputs (see Metrics Overview).utterances entries that include a speaker field.jiwer.process_words(ref, hyp).alignments gives the alignment).Use this to decide whether to enable diarization, sentence-level timestamps, or enrichment features — capture cost / latency impact alongside accuracy.
Include:
language=multi, word_timestamps=true, diarize=true)The latency_ms you measure with time.perf_counter() includes network round-trip from your client to api.smallest.ai, so numbers vary with client region. Pulse’s published server-side TTFT of ~64 ms is measured at the inference pod (see the Pulse model card) — your wall-clock measurement is typically higher because of network. RTF stays meaningful either way: it’s processing_time / audio_duration, so longer audio dilutes per-request overhead.
This completes the process of self-metric evaluation. With these steps, you can identify strengths and weaknesses of Pulse STT — or compare it against any other ASR system that returns a similar response shape — on workloads representative of your production traffic.