***

title: Performance
description: Latency, accuracy, and throughput benchmarks for Pulse STT
---------------------

For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.smallest.ai/waves/v-4-0-0/documentation/speech-to-text-pulse/benchmarks/llms.txt. For full documentation content, see https://docs.smallest.ai/waves/v-4-0-0/documentation/speech-to-text-pulse/benchmarks/llms-full.txt.

## Latency Metrics

### Time-to-First-Transcript (TTFT)

Our Pulse STT model provides State of the art TTFT latency of \~**64ms**, which is one of the least in the world.

<Accordion title="TTFT Comparison Analysis">
  TTFT (Time to First Transcript) measures the latency between when a user stops speaking and when the model returns the complete transcript. Lower TTFT means faster response times and better user experience in real-time applications.

  <table>
    <thead>
      <tr>
        <th>
          Model
        </th>

        <th>
          Latency (ms)
        </th>
      </tr>
    </thead>

    <tbody>
      <tr>
        <td>
          Smallest Pulse STT
        </td>

        <td>
          64
        </td>
      </tr>

      <tr>
        <td>
          Deepgram Nova 2
        </td>

        <td>
          76
        </td>
      </tr>

      <tr>
        <td>
          Deepgram Nova 3
        </td>

        <td>
          71
        </td>
      </tr>

      <tr>
        <td>
          Assembly AI Universal
        </td>

        <td>
          698
        </td>
      </tr>
    </tbody>
  </table>
</Accordion>

## Accuracy Metrics

### Word Error Rate (WER)

All models were evaluated on the FLEURS dataset, a standardised multilingual speech benchmark ensuring fair cross-model comparison.

<table>
  <thead>
    <tr>
      <th>
        Language
      </th>

      <th>
        WER
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        English
      </td>

      <td>
        5.1%
      </td>
    </tr>

    <tr>
      <td>
        Italian
      </td>

      <td>
        4.2%
      </td>
    </tr>

    <tr>
      <td>
        Spanish
      </td>

      <td>
        5.4%
      </td>
    </tr>

    <tr>
      <td>
        Hindi
      </td>

      <td>
        11.4%
      </td>
    </tr>
  </tbody>
</table>

## Throughput

### Requests Per Second

<table>
  <thead>
    <tr>
      <th>
        Audio Length
      </th>

      <th>
        HTTP POST
      </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>
        Short (< 5s)
      </td>

      <td>
        50-100
      </td>
    </tr>

    <tr>
      <td>
        Medium (5-30s)
      </td>

      <td>
        20-50
      </td>
    </tr>

    <tr>
      <td>
        Long (30s+)
      </td>

      <td>
        10-20
      </td>
    </tr>
  </tbody>
</table>

*Throughput varies based on audio length, format, and server load*

## Performance by Audio Format

### Linear16 (PCM)

* **Latency**: Lowest (\~64ms)
* **Accuracy**: Highest
* **Bandwidth**: Highest
* **Best for**: High-quality applications

### Opus

* **Latency**: Low (\~70-80ms)
* **Accuracy**: High
* **Bandwidth**: Low
* **Best for**: Browser/mobile applications

### FLAC

* **Latency**: Medium (\~80-90ms)
* **Accuracy**: Highest
* **Bandwidth**: Medium
* **Best for**: Archival/quality-critical use cases

### μ-law

* **Latency**: Low (\~65-75ms)
* **Accuracy**: Good
* **Bandwidth**: Lowest
* **Best for**: Telephony applications

## Performance by Language

### High-Performance Languages

* **Italian**: 4.2% WER, \~64ms latency
* **English**: 5.1% WER, \~64ms latency
* **Spanish**: 5.4% WER, \~64ms latency
* **Portuguese**: 7.1% WER, \~64ms latency
* **German**: 8.5% WER, \~64ms latency
* **French**: 9.2% WER, \~64ms latency

### Regional Variations

* **Indian Languages**: 10-15% WER, \~90-100ms latency
* **Eastern European**: 9-12% WER, \~85-95ms latency

## Feature Impact on Performance

### Diarization

* **Latency Impact**: +10-20ms
* **Accuracy Impact**: Minimal
* **Use When**: Multiple speakers present

### Word Timestamps

* **Latency Impact**: +5-10ms
* **Accuracy Impact**: None
* **Use When**: Timing information needed

### Emotion Detection

* **Latency Impact**: +15-25ms
* **Accuracy Impact**: None
* **Use When**: Emotion analysis required

## Optimization Tips

* Use 16kHz sample rate for optimal balance
* Choose linear16 format for lowest latency
* Enable only needed features to reduce latency
* Batch process when latency isn't critical

## Next Steps

* [Metrics Overview](/waves/documentation/speech-to-text-pulse/benchmarks/metrics-overview).
* [Evaluation Walkthrough](/waves/documentation/speech-to-text-pulse/benchmarks/evaluation-walkthrough).
* [Best Practices](/waves/documentation/speech-to-text-pulse/pre-recorded/best-practices).