***
title: Performance
description: 'Latency, accuracy, and throughput benchmarks for Pulse STT'
-------------------------------------------------------------------------
## Latency Metrics
### Time-to-First-Transcript (TTFT)
Our Pulse STT model provides State of the art TTFT latency of \~**64ms**, which is one of the least in the world.
TTFT (Time to First Transcript) measures the latency between when a user stops speaking and when the model returns the complete transcript. Lower TTFT means faster response times and better user experience in real-time applications.
|
Model
|
Latency (ms)
|
|
Smallest Pulse STT
|
64
|
|
Deepgram Nova 2
|
76
|
|
Deepgram Nova 3
|
71
|
|
Assembly AI Universal
|
698
|
## Accuracy Metrics
### Word Error Rate (WER)
All models were evaluated on the FLEURS dataset, a standardised multilingual speech benchmark ensuring fair cross-model comparison.
|
Language
|
WER
|
|
English
|
5.1%
|
|
Italian
|
4.2%
|
|
Spanish
|
5.4%
|
|
Hindi
|
11.4%
|
## Throughput
### Requests Per Second
|
Audio Length
|
HTTP POST
|
|
Short (< 5s)
|
50-100
|
|
Medium (5-30s)
|
20-50
|
|
Long (30s+)
|
10-20
|
*Throughput varies based on audio length, format, and server load*
## Performance by Audio Format
### Linear16 (PCM)
* **Latency**: Lowest (\~64ms)
* **Accuracy**: Highest
* **Bandwidth**: Highest
* **Best for**: High-quality applications
### Opus
* **Latency**: Low (\~70-80ms)
* **Accuracy**: High
* **Bandwidth**: Low
* **Best for**: Browser/mobile applications
### FLAC
* **Latency**: Medium (\~80-90ms)
* **Accuracy**: Highest
* **Bandwidth**: Medium
* **Best for**: Archival/quality-critical use cases
### μ-law
* **Latency**: Low (\~65-75ms)
* **Accuracy**: Good
* **Bandwidth**: Lowest
* **Best for**: Telephony applications
## Performance by Language
### High-Performance Languages
* **Italian**: 4.2% WER, \~64ms latency
* **English**: 5.1% WER, \~64ms latency
* **Spanish**: 5.4% WER, \~64ms latency
* **Portuguese**: 7.1% WER, \~64ms latency
* **German**: 8.5% WER, \~64ms latency
* **French**: 9.2% WER, \~64ms latency
### Regional Variations
* **Indian Languages**: 10-15% WER, \~90-100ms latency
* **Eastern European**: 9-12% WER, \~85-95ms latency
## Feature Impact on Performance
### Diarization
* **Latency Impact**: +10-20ms
* **Accuracy Impact**: Minimal
* **Use When**: Multiple speakers present
### Word Timestamps
* **Latency Impact**: +5-10ms
* **Accuracy Impact**: None
* **Use When**: Timing information needed
### Emotion Detection
* **Latency Impact**: +15-25ms
* **Accuracy Impact**: None
* **Use When**: Emotion analysis required
## Optimization Tips
* Use 16kHz sample rate for optimal balance
* Choose linear16 format for lowest latency
* Enable only needed features to reduce latency
* Batch process when latency isn't critical
## Next Steps
* [Metrics Overview](/waves/documentation/speech-to-text/benchmarks/metrics-overview).
* [Evaluation Walkthrough](/waves/documentation/speech-to-text/benchmarks/evaluation-walkthrough).
* [Best Practices](/waves/documentation/speech-to-text/pre-recorded/best-practices).