*** title: Performance description: 'Latency, accuracy, and throughput benchmarks for Pulse STT' ------------------------------------------------------------------------- ## Latency Metrics ### Time-to-First-Transcript (TTFT) Our Pulse STT model provides State of the art TTFT latency of \~**64ms**, which is one of the least in the world. TTFT (Time to First Transcript) measures the latency between when a user stops speaking and when the model returns the complete transcript. Lower TTFT means faster response times and better user experience in real-time applications.
Model Latency (ms)
Smallest Pulse STT 64
Deepgram Nova 2 76
Deepgram Nova 3 71
Assembly AI Universal 698
## Accuracy Metrics ### Word Error Rate (WER) All models were evaluated on the FLEURS dataset, a standardised multilingual speech benchmark ensuring fair cross-model comparison.
Language WER
English 5.1%
Italian 4.2%
Spanish 5.4%
Hindi 11.4%
## Throughput ### Requests Per Second
Audio Length HTTP POST
Short (< 5s) 50-100
Medium (5-30s) 20-50
Long (30s+) 10-20
*Throughput varies based on audio length, format, and server load* ## Performance by Audio Format ### Linear16 (PCM) * **Latency**: Lowest (\~64ms) * **Accuracy**: Highest * **Bandwidth**: Highest * **Best for**: High-quality applications ### Opus * **Latency**: Low (\~70-80ms) * **Accuracy**: High * **Bandwidth**: Low * **Best for**: Browser/mobile applications ### FLAC * **Latency**: Medium (\~80-90ms) * **Accuracy**: Highest * **Bandwidth**: Medium * **Best for**: Archival/quality-critical use cases ### μ-law * **Latency**: Low (\~65-75ms) * **Accuracy**: Good * **Bandwidth**: Lowest * **Best for**: Telephony applications ## Performance by Language ### High-Performance Languages * **Italian**: 4.2% WER, \~64ms latency * **English**: 5.1% WER, \~64ms latency * **Spanish**: 5.4% WER, \~64ms latency * **Portuguese**: 7.1% WER, \~64ms latency * **German**: 8.5% WER, \~64ms latency * **French**: 9.2% WER, \~64ms latency ### Regional Variations * **Indian Languages**: 10-15% WER, \~90-100ms latency * **Eastern European**: 9-12% WER, \~85-95ms latency ## Feature Impact on Performance ### Diarization * **Latency Impact**: +10-20ms * **Accuracy Impact**: Minimal * **Use When**: Multiple speakers present ### Word Timestamps * **Latency Impact**: +5-10ms * **Accuracy Impact**: None * **Use When**: Timing information needed ### Emotion Detection * **Latency Impact**: +15-25ms * **Accuracy Impact**: None * **Use When**: Emotion analysis required ## Optimization Tips * Use 16kHz sample rate for optimal balance * Choose linear16 format for lowest latency * Enable only needed features to reduce latency * Batch process when latency isn't critical ## Next Steps * [Metrics Overview](/waves/documentation/speech-to-text/benchmarks/metrics-overview). * [Evaluation Walkthrough](/waves/documentation/speech-to-text/benchmarks/evaluation-walkthrough). * [Best Practices](/waves/documentation/speech-to-text/pre-recorded/best-practices).