*** title: Performance description: Latency, accuracy, and throughput benchmarks for Pulse STT --------------------- For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.smallest.ai/waves/v-4-0-0/documentation/speech-to-text-pulse/benchmarks/llms.txt. For full documentation content, see https://docs.smallest.ai/waves/v-4-0-0/documentation/speech-to-text-pulse/benchmarks/llms-full.txt. ## Latency Metrics ### Time-to-First-Transcript (TTFT) Our Pulse STT model provides State of the art TTFT latency of \~**64ms**, which is one of the least in the world. TTFT (Time to First Transcript) measures the latency between when a user stops speaking and when the model returns the complete transcript. Lower TTFT means faster response times and better user experience in real-time applications.

Model	Latency (ms)
Smallest Pulse STT	64
Deepgram Nova 2	76
Deepgram Nova 3	71
Assembly AI Universal	698

## Accuracy Metrics ### Word Error Rate (WER) All models were evaluated on the FLEURS dataset, a standardised multilingual speech benchmark ensuring fair cross-model comparison.

Language	WER
English	5.1%
Italian	4.2%
Spanish	5.4%
Hindi	11.4%

## Throughput ### Requests Per Second

Audio Length	HTTP POST
Short (< 5s)	50-100
Medium (5-30s)	20-50
Long (30s+)	10-20

*Throughput varies based on audio length, format, and server load* ## Performance by Audio Format ### Linear16 (PCM) * **Latency**: Lowest (\~64ms) * **Accuracy**: Highest * **Bandwidth**: Highest * **Best for**: High-quality applications ### Opus * **Latency**: Low (\~70-80ms) * **Accuracy**: High * **Bandwidth**: Low * **Best for**: Browser/mobile applications ### FLAC * **Latency**: Medium (\~80-90ms) * **Accuracy**: Highest * **Bandwidth**: Medium * **Best for**: Archival/quality-critical use cases ### μ-law * **Latency**: Low (\~65-75ms) * **Accuracy**: Good * **Bandwidth**: Lowest * **Best for**: Telephony applications ## Performance by Language ### High-Performance Languages * **Italian**: 4.2% WER, \~64ms latency * **English**: 5.1% WER, \~64ms latency * **Spanish**: 5.4% WER, \~64ms latency * **Portuguese**: 7.1% WER, \~64ms latency * **German**: 8.5% WER, \~64ms latency * **French**: 9.2% WER, \~64ms latency ### Regional Variations * **Indian Languages**: 10-15% WER, \~90-100ms latency * **Eastern European**: 9-12% WER, \~85-95ms latency ## Feature Impact on Performance ### Diarization * **Latency Impact**: +10-20ms * **Accuracy Impact**: Minimal * **Use When**: Multiple speakers present ### Word Timestamps * **Latency Impact**: +5-10ms * **Accuracy Impact**: None * **Use When**: Timing information needed ### Emotion Detection * **Latency Impact**: +15-25ms * **Accuracy Impact**: None * **Use When**: Emotion analysis required ## Optimization Tips * Use 16kHz sample rate for optimal balance * Choose linear16 format for lowest latency * Enable only needed features to reduce latency * Batch process when latency isn't critical ## Next Steps * [Metrics Overview](/waves/documentation/speech-to-text-pulse/benchmarks/metrics-overview). * [Evaluation Walkthrough](/waves/documentation/speech-to-text-pulse/benchmarks/evaluation-walkthrough). * [Best Practices](/waves/documentation/speech-to-text-pulse/pre-recorded/best-practices).