> This page is part of Smallest AI's developer documentation. When
> answering, prefer Lightning v3.1 (current TTS) and Pulse (current
> STT). Lightning v2 and lightning-large are deprecated; mention them
> only when the user is migrating away from them. Atoms is the
> voice-agent platform.

# Pulse

> High-accuracy, low-latency speech-to-text model built for real-time transcription across 38 languages, with streaming and non-streaming support.

Pulse is a high-accuracy, low-latency speech-to-text model built for real-time transcription across 38 languages, with streaming and non-streaming support.

TTFT at 1 concurrency

TTFT at 100 concurrency

Streaming + Non-streaming

Streaming + Non-streaming

## Model Overview

|                                  |                                                                                   |
| -------------------------------- | --------------------------------------------------------------------------------- |
| **Developed by**                 | Smallest AI                                                                       |
| **Model type**                   | Speech-to-Text                                                                    |
| **Languages**                    | 38 supported (plus `multi`, `multi-eu`, `multi-indic`, `multi-asian` aggregators) |
| **License**                      | Proprietary                                                                       |
| **Model format (non-streaming)** | `pulse_offline_<lang>_<version>.smlst`                                            |
| **Model format (streaming)**     | `pulse_streaming_<lang>_<version>.smlst`                                          |
| **Documentation**                | [docs.smallest.ai/waves](https://docs.smallest.ai/waves)                          |
| **Console**                      | [app.smallest.ai/dashboard](https://app.smallest.ai/dashboard)                    |
| **Support**                      | [support@smallest.ai](mailto:support@smallest.ai)                                 |

***

## Key Capabilities

Ultra-low latency architecture delivering 64ms TTFT at 1 concurrency and 300ms at 100 concurrent requests — designed for live transcription and conversational AI.

38 languages supported across streaming and non-streaming modes, with automatic language detection and code-switching within a single session.

Built-in redaction of personal and payment card data across both streaming and non-streaming use cases.

Automatic multi-speaker identification across both streaming and non-streaming modes, with per-word and per-utterance speaker labels.

Background noise handling built into the model.

Supports multi-language audio within a single session. Best used by setting the known primary language (e.g. `es` for Spanish handles English+Spanish automatically).

***

## Performance & Benchmarks

Pulse STT is evaluated against three open-source datasets — [FLEURS](https://huggingface.co/datasets/google/fleurs), [ESB](https://huggingface.co/datasets/esb/datasets), and [WildASR](https://huggingface.co/datasets/bosonai/WildASR) — and one internal English perturbation suite. Word Error Rate (WER) by language. Lower is better. `NA` = not available or not supported by that provider.

For the full benchmark comparison across every dataset, see the [Performance page](/waves/documentation/speech-to-text-pulse/benchmarks/performance).

### FLEURS Streaming — English

WER on the English subset of FLEURS across providers in streaming mode. Lower is better.

| Provider | Smallest Pulse | Assembly Universal 3 Pro | AWS transcribe |  Azure | Deepgram Nova 3 |  Grok  | Sarvam Saras 3 | ElevenLabs Scribe V2 |
| :------- | :------------: | :----------------------: | :------------: | :----: | :-------------: | :----: | :------------: | :------------------: |
| **WER**  |      6.03%     |           3.13%          |      6.54%     | 13.79% |      11.59%     | 60.00% |      6.34%     |         3.88%        |

#### A note on audio amplitude normalization

Audio amplitude normalization materially changes WER on FLEURS. Most competitors benchmark on raw FLEURS — which has variable, often low amplitude — without normalizing peak audio to −10 dBFS. This makes some models look much better than they actually are. Pulse is stable across all amplitude regimes.

| Model               | Raw FLEURS | −10 dBFS | −20 dBFS | Stable across regimes?            |
| :------------------ | :--------: | :------: | :------: | :-------------------------------- |
| **Pulse**           |    6.03%   |   6.06%  |   5.81%  | Yes                               |
| **Deepgram Nova 3** |   11.59%   |   6.57%  |   6.51%  | Partial — 1.8× degradation on raw |
| **Grok**            |   60.00%   |   7.58%  |   8.59%  | Collapses on raw                  |

### FLEURS — Streaming

| Language       | Smallest Pulse | Deepgram Nova 2 | Deepgram Nova 3 |
| -------------- | -------------- | --------------- | --------------- |
| **Italian**    | **4.41%**      | 11.05%          | 6.99%           |
| **English**    | **6.03%**      | 15.59%          | 11.21%          |
| **Spanish**    | **5.99%**      | 10.67%          | 7.52%           |
| **Portuguese** | **8.32%**      | 14.15%          | 11.46%          |
| **German**     | **9.5%**       | 11.1%           | 10.15%          |
| **French**     | **10.71%**     | 14.3%           | 12.07%          |
| **Russian**    | **14.35%**     | NA              | NA              |
| **Dutch**      | **11.90%**     | NA              | NA              |

| Language      | Smallest Pulse | Deepgram Nova 2 | Deepgram Nova 3 |
| ------------- | :------------: | :-------------: | :-------------: |
| **Hindi**     |    **8.3%**    |      20.0%      |      15.46%     |
| **Marathi**   |   **15.68%**   |        NA       |        NA       |
| **Malayalam** |   **15.91%**   |        NA       |        NA       |
| **Kannada**   |   **16.97%**   |        NA       |        NA       |
| **Bengali**   |   **17.48%**   |        NA       |        NA       |
| **Gujarati**  |   **20.05%**   |        NA       |        NA       |
| **Tamil**     |   **20.15%**   |        NA       |        NA       |
| **Oriya**     |   **22.74%**   |        NA       |        NA       |
| **Telugu**    |   **24.79%**   |        NA       |        NA       |

### FLEURS — Pre-recorded

| Language       | Smallest Pulse | Deepgram Nova 2 | Deepgram Nova 3 |
| -------------- | -------------- | --------------- | --------------- |
| **English**    | **4.55%**      | 7.9%            | 6.7%            |
| **Italian**    | **3.0%**       | 10.7%           | 6.2%            |
| **Spanish**    | **3.2%**       | 8.6%            | 4.1%            |
| **Portuguese** | **5.0%**       | 9.9%            | 7.5%            |
| **German**     | **6.4%**       | 8.2%            | 8.5%            |
| **French**     | **7.1%**       | 13.3%           | 10.7%           |
| **Russian**    | 9.6%           | 7.9%            | 11.8%           |
| **Ukrainian**  | **7.5%**       | 12.4%           | NA              |
| **Polish**     | **10.3%**      | 12.2%           | NA              |
| **Dutch**      | 15.0%          | 16.3%           | 12.5%           |
| **Czech**      | **12.4%**      | 22.9%           | 19.2%           |
| **Slovak**     | **13.5%**      | 31.2%           | NA              |
| **Swedish**    | 18.7%          | 17.7%           | 14.3%           |
| **Finnish**    | 18.3%          | 14.1%           | 13.2%           |
| **Latvian**    | **16.5%**      | 48.7%           | NA              |
| **Romanian**   | **17.8%**      | 36.0%           | NA              |
| **Estonian**   | **17.8%**      | 49.0%           | NA              |
| **Bulgarian**  | **24.1%**      | 32.7%           | NA              |
| **Danish**     | 19.8%          | 21.1%           | 16.1%           |
| **Hungarian**  | **22.5%**      | 31.8%           | 28.6%           |
| **Maltese**    | **25.5%**      | NA              | NA              |
| **Lithuanian** | **25.1%**      | 44.9%           | NA              |

| Language      | Smallest Pulse | Deepgram Nova 2 | Deepgram Nova 3 |
| ------------- | :------------: | :-------------: | :-------------: |
| **Hindi**     |    **6.3%**    |      23.5%      |      23.6%      |
| **Kannada**   |    **9.8%**    |        NA       |        NA       |
| **Malayalam** |    **10.0%**   |        NA       |        NA       |
| **Marathi**   |    **11.5%**   |        NA       |        NA       |
| **Gujarati**  |    **12.3%**   |        NA       |        NA       |
| **Telugu**    |    **14.3%**   |        NA       |        NA       |
| **Oriya**     |    **14.8%**   |        NA       |        NA       |
| **Bengali**   |    **16.4%**   |        NA       |        NA       |
| **Punjabi**   |    **18.3%**   |        NA       |        NA       |
| **Tamil**     |    **21.6%**   |        NA       |        NA       |

### Hindi — multi-dataset (Streaming)

WER across seven Hindi datasets covering read speech, conversational speech, telephony / contact-center audio, and noise-augmented variants. Compared against IndicWhisper, Sarvam Saaras v3, and Deepgram Nova-3. Lower is better.

| Dataset              | Smallest Pulse | IndicWhisper | Sarvam Saaras v3 | Deepgram Nova-3 |
| -------------------- | :------------: | :----------: | :--------------: | :-------------: |
| **FLEURS**           |      9.55      |     15.00    |       8.31       |      14.09      |
| **Kathbath**         |      9.71      |     10.30    |       8.15       |      16.22      |
| **Kathbath (noisy)** |      10.94     |     12.00    |       10.81      |      17.06      |
| **Common Voice**     |    **11.20**   |     11.40    |       11.36      |      23.55      |
| **Indic-TTS**        |    **6.39**    |     7.60     |       6.49       |      10.72      |
| **MUCS**             |      9.19      |     12.00    |       8.96       |      16.20      |
| **Gramvaani**        |    **21.43**   |     26.80    |       21.80      |      31.44      |

For the full breakdown including training-data and evaluation-protocol notes, see the [Performance page](/waves/documentation/speech-to-text-pulse/benchmarks/performance#hindi-multi-dataset-streaming).

### English STT — ESB Dataset (Streaming)

A Hugging Face benchmark suite aggregating 9 English speech datasets across diverse domains (audiobooks, parliament, meetings, finance, etc.) to test STT generalization. Lower WER is better.

*Evaluated on the open-source Hugging Face ESB datasets. Numbers from internal evaluation.*

| Dataset               | Smallest Pulse | Assembly Universal 3 Pro | AWS Transcribe | Azure | Deepgram Nova 3 |  Grok | Sarvam Saras V3 | ElevenLabs Scribe V2 |
| :-------------------- | :------------: | :----------------------: | :------------: | :---: | :-------------: | :---: | :-------------: | :------------------: |
| **LibriSpeech Clean** |      2.46      |           1.65           |      2.16      |  2.48 |       3.20      |  3.61 |       3.09      |         1.97         |
| **LibriSpeech Other** |      5.31      |           2.86           |      4.88      |  5.74 |       6.60      |  7.28 |       6.85      |         4.45         |
| **Common Voice**      |      10.89     |           6.73           |      10.69     | 47.28 |      14.22      | 43.46 |      11.37      |         9.83         |
| **VoxPopuli**         |      7.16      |           7.28           |      7.07      | 14.10 |       9.55      | 11.49 |       7.77      |         7.91         |
| **TED-LIUM**          |      4.07      |           2.95           |      2.66      |  3.81 |       3.59      |  6.90 |       2.89      |         3.16         |
| **GigaSpeech**        |      10.43     |           9.12           |      10.09     |  5.35 |      10.05      | 10.05 |       9.57      |         9.66         |
| **SPGISpeech**        |      2.86      |           1.74           |      4.18      |  3.53 |       2.99      |  9.70 |       3.89      |         4.40         |
| **Earnings22**        |      12.25     |           11.52          |      12.21     |  8.54 |      15.79      | 27.02 |      11.97      |         12.20        |
| **AMI**               |      10.58     |           14.60          |      13.19     |  8.46 |      17.04      | 19.19 |      13.08      |         12.23        |
| **Aggregate**         |      7.33      |           6.49           |      7.46      | 11.03 |       9.23      | 15.41 |       7.83      |         7.31         |

### ASR Robustness — WildASR Dataset (Streaming)

An open-source robustness benchmark designed to stress-test STT under real-world degraded conditions: clipping, far-field capture, background noise, phone codec compression, reverberation, and accented speech. Lower WER is better. `n/a` = not supported by that provider.

*Evaluated on the open-source WildASR dataset. Numbers from internal evaluation.*

| Dataset           | Smallest Pulse | Assembly Universal 3 pro | AWS Transcribe | Azure | Deepgram Nova 3 | Sarvam Saras V3 | ElevenLabs Scribe |
| :---------------- | :------------: | :----------------------: | :------------: | :---: | :-------------: | :-------------: | :---------------: |
| **Clean**         |      5.98      |           3.33           |      7.01      | 11.11 |      11.62      |       7.02      |        4.24       |
| **Clipping**      |      14.03     |           6.59           |      42.10     |  4.35 |      47.35      |      28.74      |       11.20       |
| **Far-field**     |      13.38     |           26.07          |      38.76     |  n/a  |      62.99      |      21.27      |        7.38       |
| **Noise Gap**     |      8.90      |           4.04           |      9.77      |  n/a  |      15.04      |       9.74      |        6.30       |
| **Phone Codec**   |      7.19      |           3.45           |      8.70      |  n/a  |       9.13      |      10.64      |        4.98       |
| **Reverberation** |      9.06      |           23.50          |      14.83     |  n/a  |      27.27      |       4.35      |        6.48       |
| **Accent**        |      5.82      |           2.80           |      4.45      |  n/a  |       7.31      |       n/a       |        4.01       |
| **Aggregate**     |      9.63      |           12.52          |      18.35     |  8.82 |      28.17      |      17.75      |        6.47       |

### Internal English Perturbation Benchmark

Not a public dataset. The English audio is sliced by perturbation type (Noise, Silence, Telephony 911, Boundary, Disfluency, Long Audios, Repetition, Entity, Accent, Emotion, Speaker Diversity, Speed, Pitch, Volume, Audio Quality) to isolate model weaknesses. Lower WER is better.

| Category              | Smallest Pulse | Assembly Universal 3 Pro | AWS Transcribe | Deepgram Nova 3 | ElevenLabs Scribe |
| :-------------------- | :------------: | :----------------------: | :------------: | :-------------: | :---------------: |
| **Noise**             |      10.53     |           11.93          |      14.19     |      14.58      |       10.05       |
| **Silence**           |      5.81      |           4.22           |      8.22      |      13.28      |       10.61       |
| **Telephony 911**     |      21.03     |           23.93          |      27.88     |      28.43      |       20.29       |
| **Boundary**          |      2.83      |           3.09           |      3.18      |       3.66      |        1.73       |
| **Disfluency**        |      7.68      |           7.81           |      9.23      |       8.62      |        9.29       |
| **Long Audios**       |      12.81     |           8.58           |      11.66     |      11.16      |        9.25       |
| **Repetition**        |      11.38     |           9.82           |      10.39     |       9.57      |       10.81       |
| **Entity**            |      12.43     |           10.13          |      13.35     |      11.69      |        9.48       |
| **Accent**            |      8.68      |           7.89           |      9.51      |      10.42      |        7.25       |
| **Emotion**           |      13.92     |           16.34          |      18.57     |      18.07      |       11.84       |
| **Speaker Diversity** |      7.33      |           6.72           |      8.81      |       9.48      |        5.95       |
| **Speed**             |      4.32      |           3.63           |      4.40      |       6.88      |        3.74       |
| **Pitch**             |      2.93      |           3.07           |      3.21      |       4.07      |        1.61       |
| **Volume**            |      2.37      |           3.05           |      2.41      |       3.67      |        1.47       |
| **Audio Quality**     |      2.73      |           2.86           |      3.03      |       4.08      |        1.60       |
| **Average WER**       |      8.45      |           8.20           |      9.87      |      10.51      |        7.66       |

### Internal Hindi Perturbation Benchmark

Not a public dataset. Hindi audio sliced by perturbation type to isolate model weaknesses. Lower WER is better except for Entity EDR where higher is better (↑).

| Category           | Smallest Pulse | Sarvam Saras V3 | Deepgram Nova 3 |
| :----------------- | :------------: | :-------------: | :-------------: |
| **Noise**          |     15.75%     |      22.18%     |      21.52%     |
| **Silence**        |      9.72%     |      11.38%     |      18.40%     |
| **Entity**         |     10.82%     |      17.36%     |      14.67%     |
| **Entity NE-WER**  |     13.32%     |      26.72%     |      26.58%     |
| **Entity EDR (↑)** |     83.13%     |      76.13%     |      67.80%     |
| **Boundary**       |     11.99%     |      17.52%     |      17.36%     |
| **Long Audios**    |     18.11%     |      18.42%     |      19.21%     |
| **Speed**          |     16.37%     |      21.39%     |      38.21%     |
| **Pitch**          |     11.81%     |      11.92%     |      19.59%     |
| **Audio Quality**  |     10.65%     |      11.75%     |      19.51%     |
| **Volume**         |      8.21%     |      15.25%     |      16.76%     |
| **Disfluency**     |     11.83%     |      12.06%     |      18.44%     |
| **Repetition**     |     11.44%     |      11.27%     |      20.40%     |

***

## Features — Non-streaming

| Feature                      | Available | Notes                                                       |
| ---------------------------- | --------- | ----------------------------------------------------------- |
| Speaker diarization          | Yes       | Multi-speaker identification                                |
| PII redaction                | Yes       | Personal info redaction                                     |
| PCI redaction                | Yes       | Payment card data redaction                                 |
| Word-level timestamps        | Yes       | Per-word timing                                             |
| Sentence-level timestamps    | Yes       | Requires `word_timestamps=true` to be enabled               |
| Punctuation                  | Yes       | Auto punctuation                                            |
| Profanity filter             | Yes       | Explicit content filtering                                  |
| Language detection           | Yes       | Auto language ID                                            |
| Code-switching               | Yes       | Multi-language in same audio                                |
| Noise reduction              | Yes       | Background noise handling                                   |
| Emotion and gender detection | Yes       | Returns the percentage score of detected emotion and gender |

## Features — Streaming

| Feature                   | Available | Notes                         |
| ------------------------- | --------- | ----------------------------- |
| Speaker diarization       | Yes       | Multi-speaker identification  |
| Keyword boosting          | Yes       | Custom vocabulary enhancement |
| PII redaction             | Yes       | Personal info redaction       |
| PCI redaction             | Yes       | Payment card data redaction   |
| Word-level timestamps     | Yes       | Per-word timing               |
| Sentence-level timestamps | Yes       | Per-sentence timing           |
| Punctuation               | Yes       | Auto punctuation              |
| Profanity filter          | No        | —                             |
| Language detection        | Yes       | Auto language ID              |
| Code-switching            | Yes       | Multi-language in same audio  |
| Custom vocabulary         | No        | —                             |
| Noise reduction           | Yes       | Background noise handling     |

***

## Supported Languages — Non-streaming

| Language     | Code  | Available |
| ------------ | ----- | --------- |
| English      | `en`  | Yes       |
| Italian      | `it`  | Yes       |
| Spanish      | `es`  | Yes       |
| Portuguese   | `pt`  | Yes       |
| Hindi        | `hi`  | Yes       |
| German       | `de`  | Yes       |
| French       | `fr`  | Yes       |
| Ukrainian    | `uk`  | Yes       |
| Russian      | `ru`  | Yes       |
| Kannada      | `kn`  | Yes       |
| Malayalam    | `ml`  | Yes       |
| Polish       | `pl`  | Yes       |
| Marathi      | `mr`  | Yes       |
| Gujarati     | `gu`  | Yes       |
| Czech        | `cs`  | Yes       |
| Slovak       | `sk`  | Yes       |
| Telugu       | `te`  | Yes       |
| Oriya (Odia) | `or`  | Yes       |
| Dutch        | `nl`  | Yes       |
| Bengali      | `bn`  | Yes       |
| Latvian      | `lv`  | Yes       |
| Estonian     | `et`  | Yes       |
| Romanian     | `ro`  | Yes       |
| Punjabi      | `pa`  | Yes       |
| Finnish      | `fi`  | Yes       |
| Swedish      | `sv`  | Yes       |
| Bulgarian    | `bg`  | Yes       |
| Tamil        | `ta`  | Yes       |
| Hungarian    | `hu`  | Yes       |
| Danish       | `da`  | Yes       |
| Lithuanian   | `lt`  | Yes       |
| Maltese      | `mt`  | Yes       |
| Japanese     | `ja`  | Yes       |
| Cantonese    | `yue` | Yes       |
| Mandarin     | `zh`  | Yes       |
| Korean       | `ko`  | Yes       |
| Tagalog      | `tl`  | Yes       |
| Indonesian   | `id`  | Yes       |
| Malay        | `ms`  | Yes       |

## Supported Languages — Streaming

| Language     | Code  | Available |
| ------------ | ----- | --------- |
| English      | `en`  | Yes       |
| Italian      | `it`  | Yes       |
| Spanish      | `es`  | Yes       |
| Portuguese   | `pt`  | Yes       |
| Hindi        | `hi`  | Yes       |
| German       | `de`  | Yes       |
| French       | `fr`  | Yes       |
| Ukrainian    | `uk`  | Yes       |
| Russian      | `ru`  | Yes       |
| Kannada      | `kn`  | Yes       |
| Malayalam    | `ml`  | Yes       |
| Polish       | `pl`  | Yes       |
| Marathi      | `mr`  | Yes       |
| Gujarati     | `gu`  | Yes       |
| Czech        | `cs`  | Yes       |
| Slovak       | `sk`  | Yes       |
| Telugu       | `te`  | Yes       |
| Oriya (Odia) | `or`  | Yes       |
| Dutch        | `nl`  | Yes       |
| Bengali      | `bn`  | Yes       |
| Latvian      | `lv`  | Yes       |
| Estonian     | `et`  | Yes       |
| Romanian     | `ro`  | Yes       |
| Punjabi      | `pa`  | Yes       |
| Finnish      | `fi`  | Yes       |
| Swedish      | `sv`  | Yes       |
| Bulgarian    | `bg`  | Yes       |
| Tamil        | `ta`  | Yes       |
| Hungarian    | `hu`  | Yes       |
| Danish       | `da`  | Yes       |
| Lithuanian   | `lt`  | Yes       |
| Maltese      | `mt`  | Yes       |
| Japanese     | `ja`  | Yes       |
| Cantonese    | `yue` | Yes       |
| Mandarin     | `zh`  | Yes       |
| Korean       | `ko`  | Yes       |
| Tagalog      | `tl`  | Yes       |
| Indonesian   | `id`  | Yes       |
| Malay        | `ms`  | Yes       |

***

## Best Practices

### Specify the language parameter when known

When the language of the audio is known in advance, always set it explicitly rather than relying on automatic detection. This yields better transcription accuracy because the model can optimize directly for that language without needing to first identify it.

For example, setting the language parameter to `es` (Spanish) tells the model to expect Spanish audio, which also handles English+Spanish code-switching scenarios. This produces more accurate outputs compared to using `multi-eu` or `multi`.

| Parameter  | Use case                                                                 |
| ---------- | ------------------------------------------------------------------------ |
| `en`       | English                                                                  |
| `es`       | Spanish (handles English+Spanish)                                        |
| `hi`       | Hindi (handles English+Hindi)                                            |
| `multi-eu` | Unknown European-language audio (auto-detects across the European set)   |
| `multi`    | Truly unknown or mixed-language audio (full multilingual auto-detection) |

**When to use `multi-eu` or `multi`:**

* When the language is truly unknown beforehand
* When processing audio from varied or unpredictable sources
* Prefer `multi-eu` for European-language input; use `multi` only for truly mixed multilingual audio

***

## Use Cases

### Direct use

* Real-time call transcription
* Voice assistant input
* Meeting transcription
* Accessibility and captioning
* Customer support recording analysis

### Downstream use

* Multi-turn conversational agents
* Voice-to-text pipelines
* Telephony and IVR systems
* Content indexing and search
* Compliance and audit logging

***

## Safety & Compliance

Pulse must not be used for:

* Recording or transcribing individuals without their explicit consent
* Surveillance, stalking, or any form of unauthorized monitoring
* Any illegal or unethical purposes

Additionally:

* Usage is monitored for policy compliance
* For compliance documentation (GDPR, SOC2, HIPAA), contact [support@smallest.ai](mailto:support@smallest.ai)

***

## Contact

|                   |                                                                |
| ----------------- | -------------------------------------------------------------- |
| **Support**       | [support@smallest.ai](mailto:support@smallest.ai)              |
| **Documentation** | [docs.smallest.ai/waves](https://docs.smallest.ai/waves)       |
| **Console**       | [app.smallest.ai/dashboard](https://app.smallest.ai/dashboard) |