Measuring Pulse Latency
Pulse streams transcripts back while you’re still sending audio. The two numbers that matter for most workloads are:
- End-of-turn (EOT) latency — time between a speaker stopping and the final transcript arriving. The metric voice agents are judged on. Also called Time-to-Final-Segment (TTFS) or Time-to-Complete-Transcript (TTCT) in the broader voice-agent industry.
- Time to first partial (TTFT) — startup health: how long after audio starts does the first interim transcript appear?
Plus RTFx for pre-recorded HTTP transcription.
Track p50 and p95 for whichever metric matches your workload — averages hide tail behaviour, and the tail is what users notice.
This page defines each metric, attributes total latency to the four components of the streaming pipeline, and lists the common measurement pitfalls. For copy-pasteable measurement scripts (WER + latency at p50/p90/p95) see the cookbook reference scripts at the end of the page.
What “streaming latency” actually means
For streaming STT, latency is per-utterance, not per-chunk. The number that matters is the time between an event the user can name — they finished speaking, they started speaking, the first word appeared — and the transcript arriving for that event. We document two industry-standard metrics on this page plus a pre-recorded one:
- Time to first partial — startup health: how long after audio starts does the first interim transcript appear?
- End-of-turn (EOT) latency — also called Time-to-Final-Segment (TTFS) or Time-to-Complete-Transcript (TTCT): time between a speaker stopping and the final transcript arriving. The metric voice agents are judged on.
- RTFx — for pre-recorded HTTP transcription.
Each metric below has a definition + a measurement recipe.
Set word_timestamps=true on every connection where you want per-word latency attribution. It’s optional for TTFT / EOT (they don’t need word-level position information) but required if you want to use the cookbook benchmark scripts that report per-word emission timing.
Which metric matches your workload
There are three useful streaming metrics and one pre-recorded metric. Pick the one that answers the product question you’re trying to answer.
EOT is sometimes called “end of utterance” — same metric, different vocabulary. Voice-agent builders typically search for EOT, so we lead with that name on this page.
Latency components
When a latency number is higher than expected, the cause lives in one of four places. Measure each in isolation to attribute.
Connection setup
A one-time cost paid on the first WebSocket frame of every new session: TLS handshake, HTTP upgrade, and server worker selection. Subsequent messages on the same socket don’t pay this.
Approximate with curl:
Typical: 20–80 ms depending on geographic distance to api.smallest.ai (Mumbai) or api.us.smallest.ai (Oregon). First request after a long idle window may also re-warm a server worker — exclude it from averages.
Network transit
Per-message TCP transit plus any server-side queue wait. Measure the round trip by sending a chunk and timing until the first transcript arrives back, then halving:
Typical: 5–30 ms one-way within the same region, 80–200 ms cross-region. If your client is in us-east-1 and you’re connecting to api.smallest.ai (Mumbai), you’re paying network — pin the region.
Server transcription
The model inference time. Contributes to EOT and to the first-partial latency:
Typical: 150 ms at 1 concurrency, scaling to ~300 ms at 100 concurrent requests on the production deployment. See the Pulse model card for headline numbers and concurrency curves.
Client overhead
Encoding, the send loop, the receive loop, and any UI render between ws.recv() and the transcript appearing on screen. Easy to overlook because it doesn’t show up in the wire.
Measure as wall-clock between chunk_ready and ws.send(), plus between ws.recv() and on_transcript(). Typical: 1–10 ms for a well-written client; can balloon to hundreds of milliseconds if you’re doing CPU-heavy work on the main thread or rendering through a slow framework.
Chunk size affects two components at once. Chunks below 10 ms flood the server with WebSocket frames; per-frame overhead dominates. Chunks above 500 ms turn the stream into a batch and add their own buffer latency. 100–300 ms per chunk is the practical range for low-latency streaming.
Pulse vs Pulse Pro for latency-sensitive workloads
Smallest STT ships two models. Only one of them does streaming — pick by the question you’re answering.
See the Pulse model card and Pulse Pro model card for benchmark detail and per-language coverage.
Common measurement pitfalls
- Sampling latency from
is_final=true. Finals are intentionally delayed by Pulse’s accuracy buffer. They look slow because they are — by design. Use interims. - Forgetting
word_timestamps=true. Without it theendfield on words is missing, so you can’t look up the send-time for the audio a transcript covers — measurements collapse to nonsense. - Including the first session in averages. It pays a one-time TLS + worker warm-up tax that no real user will experience after the first turn.
- Comparing across regions without pinning. Same audio + same model + different client region can swing EOT by 100+ ms. Pin the region when benchmarking the model.
- Treating TTFT and EOT as the same number. They measure different events — first-partial-arrival vs final-after-silence. A 100 ms TTFT and a 400 ms EOT can coexist and usually do.
- Trying to derive a per-chunk “live lag” number on the client. For live-captioning workloads, use the cookbook scripts (per-word emission timing) or live side-by-side testing in your production pipeline. Empirical end-to-end measurement beats a synthetic per-chunk formula.
Summary
- Per-utterance, not per-chunk. Streaming STT latency is measured between named events — speaker stops, first word appears, file uploaded — not as a continuous gap. The voice-agent industry standardised on TTFT + EOT for this reason; no widely-adopted per-chunk formula exists.
- TTFT and EOT are different metrics. First-partial timing matters for startup health and live-captioning first-paint; EOT is what voice agents are judged on. They can move independently.
- Four components, attributed individually. Connection setup, network transit, server transcription, client overhead. If a number is bad, walk the four to find which one to fix.
- Pulse for streaming, Pulse Pro for offline English accuracy. Pulse Pro has no streaming worker.
- Set
word_timestamps=trueon every streaming connection you want to measure. Without it theendfield is missing and the send-time lookup has nothing to key on. - Track p50 and p95. Single-shot averages hide the tail. The tail is what users notice.
Reference scripts and cross-references
Two reference scripts in the Smallest AI Cookbook implement the methodology end-to-end. Both take a folder of audio + a CSV of ground-truth transcripts, hit prod, and report WER alongside latency at p50, p90, and p95.
See the benchmarks README for dataset layout, install steps, and output format.
Related pages
- Realtime response format — wire signals (
is_final,words,finalize) this page measures against - Realtime quickstart — minimum working WebSocket client
- Pre-recorded quickstart — source endpoint for the RTFx code
- Pulse model card — headline TTFT numbers and concurrency curves

