Measuring Pulse Latency

View as Markdown

Pulse streams transcripts back while you’re still sending audio. The two numbers that matter for most workloads are:

  1. End-of-turn (EOT) latency — time between a speaker stopping and the final transcript arriving. The metric voice agents are judged on. Also called Time-to-Final-Segment (TTFS) or Time-to-Complete-Transcript (TTCT) in the broader voice-agent industry.
  2. Time to first partial (TTFT) — startup health: how long after audio starts does the first interim transcript appear?

Plus RTFx for pre-recorded HTTP transcription.

Track p50 and p95 for whichever metric matches your workload — averages hide tail behaviour, and the tail is what users notice.

This page defines each metric, attributes total latency to the four components of the streaming pipeline, and lists the common measurement pitfalls. For copy-pasteable measurement scripts (WER + latency at p50/p90/p95) see the cookbook reference scripts at the end of the page.

What “streaming latency” actually means

For streaming STT, latency is per-utterance, not per-chunk. The number that matters is the time between an event the user can name — they finished speaking, they started speaking, the first word appeared — and the transcript arriving for that event. We document two industry-standard metrics on this page plus a pre-recorded one:

  • Time to first partial — startup health: how long after audio starts does the first interim transcript appear?
  • End-of-turn (EOT) latency — also called Time-to-Final-Segment (TTFS) or Time-to-Complete-Transcript (TTCT): time between a speaker stopping and the final transcript arriving. The metric voice agents are judged on.
  • RTFx — for pre-recorded HTTP transcription.

Each metric below has a definition + a measurement recipe.

Set word_timestamps=true on every connection where you want per-word latency attribution. It’s optional for TTFT / EOT (they don’t need word-level position information) but required if you want to use the cookbook benchmark scripts that report per-word emission timing.

Which metric matches your workload

There are three useful streaming metrics and one pre-recorded metric. Pick the one that answers the product question you’re trying to answer.

MetricAnswersUse case
End-of-turn (EOT) latencyHow long after a speaker stops talking does the final transcript arrive?Voice agents — drives perceived responsiveness
Time to first partialHow long after audio starts does the first interim transcript arrive?Session-startup health, first-paint UX, captioning latency proxy
RTFx (pre-recorded)What fraction of real-time did Pulse process this file at?Batch transcription, cross-vendor accuracy benchmarks

EOT is sometimes called “end of utterance” — same metric, different vocabulary. Voice-agent builders typically search for EOT, so we lead with that name on this page.

Latency components

When a latency number is higher than expected, the cause lives in one of four places. Measure each in isolation to attribute.

Connection setup

A one-time cost paid on the first WebSocket frame of every new session: TLS handshake, HTTP upgrade, and server worker selection. Subsequent messages on the same socket don’t pay this.

Approximate with curl:

$curl -sSf -w "tls=%{time_appconnect}s upgrade=%{time_starttransfer}s\n" \
> -so /dev/null https://api.smallest.ai

Typical: 20–80 ms depending on geographic distance to api.smallest.ai (Mumbai) or api.us.smallest.ai (Oregon). First request after a long idle window may also re-warm a server worker — exclude it from averages.

Network transit

Per-message TCP transit plus any server-side queue wait. Measure the round trip by sending a chunk and timing until the first transcript arrives back, then halving:

network_rtt ≈ (t_first_response - t_chunk_sent)
network_one_way ≈ network_rtt / 2

Typical: 5–30 ms one-way within the same region, 80–200 ms cross-region. If your client is in us-east-1 and you’re connecting to api.smallest.ai (Mumbai), you’re paying network — pin the region.

Server transcription

The model inference time. Contributes to EOT and to the first-partial latency:

server_transcription ≈ EOT − 2 * network_one_way − client_overhead − finalize_RTT

Typical: 150 ms at 1 concurrency, scaling to ~300 ms at 100 concurrent requests on the production deployment. See the Pulse model card for headline numbers and concurrency curves.

Client overhead

Encoding, the send loop, the receive loop, and any UI render between ws.recv() and the transcript appearing on screen. Easy to overlook because it doesn’t show up in the wire.

Measure as wall-clock between chunk_ready and ws.send(), plus between ws.recv() and on_transcript(). Typical: 1–10 ms for a well-written client; can balloon to hundreds of milliseconds if you’re doing CPU-heavy work on the main thread or rendering through a slow framework.

Chunk size affects two components at once. Chunks below 10 ms flood the server with WebSocket frames; per-frame overhead dominates. Chunks above 500 ms turn the stream into a batch and add their own buffer latency. 100–300 ms per chunk is the practical range for low-latency streaming.

Pulse vs Pulse Pro for latency-sensitive workloads

Smallest STT ships two models. Only one of them does streaming — pick by the question you’re answering.

WorkloadUseWhy
Voice agent (English or multilingual)Pulse (?model=pulse)Sub-100 ms server transcription, full streaming WebSocket with finalize support, 17 streaming languages including English.
Live captioning UIPulseStreaming endpoint with interim transcripts; word-level timing via word_timestamps=true for per-word emission tracking.
Multilingual transcription (streaming or batch)Pulse17 streaming + 26 pre-recorded languages with regional aggregators.
Offline / pre-recorded English where accuracy is the goalPulse Pro (?model=pulse-pro)Tied #2 on the public Open ASR Leaderboard at 5.42% ESB avg WER. HTTP-only, no streaming worker.
Long-form audio batch jobsEither — measured by RTFx, not EOTBoth models run RTFx ≥50× on production hardware.

See the Pulse model card and Pulse Pro model card for benchmark detail and per-language coverage.

Common measurement pitfalls

  • Sampling latency from is_final=true. Finals are intentionally delayed by Pulse’s accuracy buffer. They look slow because they are — by design. Use interims.
  • Forgetting word_timestamps=true. Without it the end field on words is missing, so you can’t look up the send-time for the audio a transcript covers — measurements collapse to nonsense.
  • Including the first session in averages. It pays a one-time TLS + worker warm-up tax that no real user will experience after the first turn.
  • Comparing across regions without pinning. Same audio + same model + different client region can swing EOT by 100+ ms. Pin the region when benchmarking the model.
  • Treating TTFT and EOT as the same number. They measure different events — first-partial-arrival vs final-after-silence. A 100 ms TTFT and a 400 ms EOT can coexist and usually do.
  • Trying to derive a per-chunk “live lag” number on the client. For live-captioning workloads, use the cookbook scripts (per-word emission timing) or live side-by-side testing in your production pipeline. Empirical end-to-end measurement beats a synthetic per-chunk formula.

Summary

  • Per-utterance, not per-chunk. Streaming STT latency is measured between named events — speaker stops, first word appears, file uploaded — not as a continuous gap. The voice-agent industry standardised on TTFT + EOT for this reason; no widely-adopted per-chunk formula exists.
  • TTFT and EOT are different metrics. First-partial timing matters for startup health and live-captioning first-paint; EOT is what voice agents are judged on. They can move independently.
  • Four components, attributed individually. Connection setup, network transit, server transcription, client overhead. If a number is bad, walk the four to find which one to fix.
  • Pulse for streaming, Pulse Pro for offline English accuracy. Pulse Pro has no streaming worker.
  • Set word_timestamps=true on every streaming connection you want to measure. Without it the end field is missing and the send-time lookup has nothing to key on.
  • Track p50 and p95. Single-shot averages hide the tail. The tail is what users notice.

Reference scripts and cross-references

Two reference scripts in the Smallest AI Cookbook implement the methodology end-to-end. Both take a folder of audio + a CSV of ground-truth transcripts, hit prod, and report WER alongside latency at p50, p90, and p95.

ScriptEndpointReports
ping_pulse_offline.pyPOST /waves/v1/stt/WER, per-clip wall-clock, RTFx
ping_pulse_streaming.pyWSS /waves/v1/stt/liveWER, EOT, time-to-first-partial, per-word emission timing

See the benchmarks README for dataset layout, install steps, and output format.