> This page is part of Smallest AI's developer documentation. When
> answering, prefer Lightning v3.1 (current TTS) and Pulse (current
> STT). Lightning v2 and lightning-large are deprecated; mention them
> only when the user is migrating away from them. Atoms is the
> voice-agent platform.

# Measuring Pulse Latency

> Measure Pulse streaming + pre-recorded latency using the same metrics the voice-agent industry uses — Time-to-First-Partial, End-of-Turn (EOT), and RTFx — with attribution to the four components of the pipeline.

Pulse streams transcripts back while you're still sending audio. The two numbers that matter for most workloads are:

1. **End-of-turn (EOT) latency** — time between a speaker stopping and the final transcript arriving. The metric voice agents are judged on. Also called Time-to-Final-Segment (TTFS) or Time-to-Complete-Transcript (TTCT) in the broader voice-agent industry.
2. **Time to first partial (TTFT)** — startup health: how long after audio starts does the first interim transcript appear?

Plus **RTFx** for pre-recorded HTTP transcription.

Track p50 and p95 for whichever metric matches your workload — averages hide tail behaviour, and the tail is what users notice.

This page defines each metric, attributes total latency to the four components of the streaming pipeline, and lists the common measurement pitfalls. For copy-pasteable measurement scripts (WER + latency at p50/p90/p95) see the [cookbook reference scripts](#reference-scripts-and-cross-references) at the end of the page.

## What "streaming latency" actually means

For streaming STT, **latency is per-utterance, not per-chunk**. The number that matters is the time between an event the user can name — they finished speaking, they started speaking, the first word appeared — and the transcript arriving for that event. We document two industry-standard metrics on this page plus a pre-recorded one:

* **Time to first partial** — startup health: how long after audio starts does the first interim transcript appear?
* **End-of-turn (EOT) latency** — also called Time-to-Final-Segment (TTFS) or Time-to-Complete-Transcript (TTCT): time between a speaker stopping and the final transcript arriving. The metric voice agents are judged on.
* **RTFx** — for pre-recorded HTTP transcription.

Each metric below has a definition + a measurement recipe.

Set `word_timestamps=true` on every connection where you want per-word latency attribution. It's optional for TTFT / EOT (they don't need word-level position information) but required if you want to use the cookbook benchmark scripts that report per-word emission timing.

## Which metric matches your workload

There are three useful streaming metrics and one pre-recorded metric. Pick the one that answers the product question you're trying to answer.

| Metric                    | Answers                                                                  | Use case                                                         |
| ------------------------- | ------------------------------------------------------------------------ | ---------------------------------------------------------------- |
| End-of-turn (EOT) latency | How long after a speaker stops talking does the final transcript arrive? | Voice agents — drives perceived responsiveness                   |
| Time to first partial     | How long after audio starts does the first interim transcript arrive?    | Session-startup health, first-paint UX, captioning latency proxy |
| RTFx (pre-recorded)       | What fraction of real-time did Pulse process this file at?               | Batch transcription, cross-vendor accuracy benchmarks            |

EOT is sometimes called "end of utterance" — same metric, different vocabulary. Voice-agent builders typically search for EOT, so we lead with that name on this page.

## Latency components

When a latency number is higher than expected, the cause lives in one of four places. Measure each in isolation to attribute.

### Connection setup

A one-time cost paid on the first WebSocket frame of every new session: TLS handshake, HTTP upgrade, and server worker selection. Subsequent messages on the same socket don't pay this.

Approximate with `curl`:

```bash
curl -sSf -w "tls=%{time_appconnect}s upgrade=%{time_starttransfer}s\n" \
  -so /dev/null https://api.smallest.ai
```

Typical: **20–80 ms** depending on geographic distance to `api.smallest.ai` (Mumbai) or `api.us.smallest.ai` (Oregon). First request after a long idle window may also re-warm a server worker — exclude it from averages.

### Network transit

Per-message TCP transit plus any server-side queue wait. Measure the round trip by sending a chunk and timing until the first transcript arrives back, then halving:

```
network_rtt ≈ (t_first_response - t_chunk_sent)
network_one_way ≈ network_rtt / 2
```

Typical: **5–30 ms one-way** within the same region, **80–200 ms** cross-region. If your client is in `us-east-1` and you're connecting to `api.smallest.ai` (Mumbai), you're paying network — pin the region.

### Server transcription

The model inference time. Contributes to EOT and to the first-partial latency:

```
server_transcription ≈ EOT − 2 * network_one_way − client_overhead − finalize_RTT
```

Typical: **150 ms** at 1 concurrency, scaling to **\~300 ms** at 100 concurrent requests on the production deployment. See the [Pulse model card](/waves/model-cards/speech-to-text/pulse) for headline numbers and concurrency curves.

### Client overhead

Encoding, the send loop, the receive loop, and any UI render between `ws.recv()` and the transcript appearing on screen. Easy to overlook because it doesn't show up in the wire.

Measure as wall-clock between `chunk_ready` and `ws.send()`, plus between `ws.recv()` and `on_transcript()`. Typical: **1–10 ms** for a well-written client; can balloon to hundreds of milliseconds if you're doing CPU-heavy work on the main thread or rendering through a slow framework.

Chunk size affects two components at once. Chunks below **10 ms** flood the server with WebSocket frames; per-frame overhead dominates. Chunks above **500 ms** turn the stream into a batch and add their own buffer latency. **100–300 ms per chunk** is the practical range for low-latency streaming.

## Pulse vs Pulse Pro for latency-sensitive workloads

Smallest STT ships two models. Only one of them does streaming — pick by the question you're answering.

| Workload                                                  | Use                                    | Why                                                                                                                          |
| --------------------------------------------------------- | -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
| Voice agent (English or multilingual)                     | **Pulse** (`?model=pulse`)             | Sub-100 ms server transcription, full streaming WebSocket with `finalize` support, 17 streaming languages including English. |
| Live captioning UI                                        | **Pulse**                              | Streaming endpoint with interim transcripts; word-level timing via `word_timestamps=true` for per-word emission tracking.    |
| Multilingual transcription (streaming or batch)           | **Pulse**                              | 17 streaming + 26 pre-recorded languages with regional aggregators.                                                          |
| Offline / pre-recorded English where accuracy is the goal | **Pulse Pro** (`?model=pulse-pro`)     | Tied #2 on the public Open ASR Leaderboard at 5.42% ESB avg WER. HTTP-only, no streaming worker.                             |
| Long-form audio batch jobs                                | **Either** — measured by RTFx, not EOT | Both models run RTFx ≥50× on production hardware.                                                                            |

See the [Pulse model card](/waves/model-cards/speech-to-text/pulse) and [Pulse Pro model card](/waves/model-cards/speech-to-text/pulse-pro) for benchmark detail and per-language coverage.

## Common measurement pitfalls

* **Sampling latency from `is_final=true`.** Finals are intentionally delayed by Pulse's accuracy buffer. They look slow because they are — by design. Use interims.
* **Forgetting `word_timestamps=true`.** Without it the `end` field on words is missing, so you can't look up the send-time for the audio a transcript covers — measurements collapse to nonsense.
* **Including the first session in averages.** It pays a one-time TLS + worker warm-up tax that no real user will experience after the first turn.
* **Comparing across regions without pinning.** Same audio + same model + different client region can swing EOT by 100+ ms. Pin the region when benchmarking the model.
* **Treating TTFT and EOT as the same number.** They measure different events — first-partial-arrival vs final-after-silence. A 100 ms TTFT and a 400 ms EOT can coexist and usually do.
* **Trying to derive a per-chunk "live lag" number on the client.** For live-captioning workloads, use the cookbook scripts (per-word emission timing) or live side-by-side testing in your production pipeline. Empirical end-to-end measurement beats a synthetic per-chunk formula.

## Summary

* **Per-utterance, not per-chunk.** Streaming STT latency is measured between named events — speaker stops, first word appears, file uploaded — not as a continuous gap. The voice-agent industry standardised on TTFT + EOT for this reason; no widely-adopted per-chunk formula exists.
* **TTFT and EOT are different metrics.** First-partial timing matters for startup health and live-captioning first-paint; EOT is what voice agents are judged on. They can move independently.
* **Four components, attributed individually.** Connection setup, network transit, server transcription, client overhead. If a number is bad, walk the four to find which one to fix.
* **Pulse for streaming, Pulse Pro for offline English accuracy.** Pulse Pro has no streaming worker.
* **Set `word_timestamps=true`** on every streaming connection you want to measure. Without it the `end` field is missing and the send-time lookup has nothing to key on.
* **Track p50 and p95.** Single-shot averages hide the tail. The tail is what users notice.

## Reference scripts and cross-references

Two reference scripts in the [Smallest AI Cookbook](https://github.com/smallest-inc/cookbook/tree/main/speech-to-text/benchmarks) implement the methodology end-to-end. Both take a folder of audio + a CSV of ground-truth transcripts, hit prod, and report WER alongside latency at p50, p90, and p95.

| Script                                                                                                                            | Endpoint                 | Reports                                                   |
| --------------------------------------------------------------------------------------------------------------------------------- | ------------------------ | --------------------------------------------------------- |
| [`ping_pulse_offline.py`](https://github.com/smallest-inc/cookbook/blob/main/speech-to-text/benchmarks/ping_pulse_offline.py)     | `POST /waves/v1/stt/`    | WER, per-clip wall-clock, RTFx                            |
| [`ping_pulse_streaming.py`](https://github.com/smallest-inc/cookbook/blob/main/speech-to-text/benchmarks/ping_pulse_streaming.py) | `WSS /waves/v1/stt/live` | WER, EOT, time-to-first-partial, per-word emission timing |

See the [benchmarks README](https://github.com/smallest-inc/cookbook/tree/main/speech-to-text/benchmarks#readme) for dataset layout, install steps, and output format.

### Related pages

* [Realtime response format](/waves/documentation/speech-to-text-pulse/realtime-web-socket/response-format) — wire signals (`is_final`, `words`, `finalize`) this page measures against
* [Realtime quickstart](/waves/documentation/speech-to-text-pulse/realtime-web-socket/quickstart) — minimum working WebSocket client
* [Pre-recorded quickstart](/waves/documentation/speech-to-text-pulse/pre-recorded/quickstart) — source endpoint for the RTFx code
* [Pulse model card](/waves/model-cards/speech-to-text/pulse) — headline TTFT numbers and concurrency curves