For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
      • Performance
      • Metrics Overview
      • Evaluation Walkthrough
  • Speech to Speech (Hydra)
    • Overview
    • Quickstart
    • WebSocket connection
    • Managing sessions
    • Audio I/O
    • Turn detection & barge-in
    • Tool calling
    • Prompting voice agents
    • Errors & reconnection
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • 1. Assemble your dataset matrix
  • 2. Install the evaluation toolkit
  • 3. Transcribe + normalize
  • 4. Batch evaluation + aggregation
  • Recommended metrics to report
  • 5. Error analysis
  • 6. Compare configurations
  • 7. Publish the report
  • Example JSON summary
Speech to Text (Pulse)Benchmarks

Evaluation Walkthrough

||View as Markdown|
Was this page helpful?
Previous

Metrics Overview

Next

Hydra — Speech to Speech

Built with

Our evaluation guide outlines a repeatable process: choose representative audio, generate transcripts, compute WER/CER/latency, and document findings. The ready-to-run snippets below call the Pulse pre-recorded REST endpoint directly so you can audit the methodology and adapt it to any HTTP client.

This guide uses the Pulse REST API directly (via requests) rather than the smallestai Python SDK. The SDK is a convenience wrapper; the REST endpoint is the contract. Use the same approach in any language with an HTTP client — the request and response shapes are identical.

1. Assemble your dataset matrix

  • Collect 50–200 files per use case (support calls, meetings, media, etc.).
  • Produce verified reference transcripts plus optional speaker labels and timestamps.
  • Track metadata for accent, language, and audio quality so you can pivot metrics later.
1dataset = [
2 {"audio": "samples/en_agent01.wav", "reference": "Thank you for calling.", "language": "en"},
3 {"audio": "samples/es_call02.wav", "reference": "Hola, ¿en qué puedo ayudarte?", "language": "es"},
4 # ... 50–200 entries
5]

2. Install the evaluation toolkit

$pip install requests jiwer whisper-normalizer pandas
  • requests → HTTP client for the Pulse REST API
  • jiwer → WER/CER computation
  • whisper-normalizer → text normalization that matches official ASR evaluation guidance
  • pandas → aggregating results

The Python standard library (wave, time) covers audio-duration and latency measurement — no extra dependencies needed.

3. Transcribe + normalize

The Pulse pre-recorded endpoint accepts the raw audio bytes as application/octet-stream and returns JSON with transcription, words[], utterances[], and metadata. Latency is measured client-side; Real-Time Factor (RTF) is wall_time ÷ audio_duration — values below 1.0 indicate faster-than-real-time processing.

1import os
2import time
3import wave
4import requests
5from jiwer import wer, cer
6from whisper_normalizer.english import EnglishTextNormalizer
7
8API_KEY = os.environ["SMALLEST_API_KEY"]
9PULSE_URL = "https://api.smallest.ai/waves/v1/stt/?model=pulse"
10
11normalizer = EnglishTextNormalizer()
12
13def audio_duration_seconds(path: str) -> float:
14 """Read the audio duration from a WAV header. Use librosa/soundfile for non-WAV inputs."""
15 with wave.open(path, "rb") as wf:
16 return wf.getnframes() / wf.getframerate()
17
18def run_sample(sample: dict) -> dict:
19 """Transcribe one sample and compute per-sample metrics."""
20 with open(sample["audio"], "rb") as f:
21 audio_bytes = f.read()
22
23 t_start = time.perf_counter()
24 resp = requests.post(
25 PULSE_URL,
26 params={
27 "language": sample["language"],
28 "word_timestamps": str(sample.get("word_timestamps", True)).lower(),
29 "diarize": str(sample.get("diarize", False)).lower(),
30 },
31 headers={
32 "Authorization": f"Bearer {API_KEY}",
33 "Content-Type": "application/octet-stream",
34 },
35 data=audio_bytes,
36 timeout=60,
37 )
38 latency_ms = (time.perf_counter() - t_start) * 1000
39 resp.raise_for_status()
40 data = resp.json()
41
42 duration_s = audio_duration_seconds(sample["audio"])
43 rtf = (latency_ms / 1000) / duration_s # processing time ÷ audio duration; lower is better
44
45 ref = normalizer(sample["reference"])
46 hyp = normalizer(data["transcription"])
47
48 return {
49 "path": sample["audio"],
50 "wer": wer(ref, hyp),
51 "cer": cer(ref, hyp),
52 "latency_ms": latency_ms,
53 "rtf": rtf,
54 "duration_s": duration_s,
55 "transcription": data["transcription"],
56 "words": data.get("words", []),
57 "utterances": data.get("utterances", []),
58 }

For multilingual evaluation, swap the normalizer:

1from whisper_normalizer.basic import BasicTextNormalizer
2normalizer = BasicTextNormalizer() # locale-agnostic; works for Hindi, Spanish, French, etc.

4. Batch evaluation + aggregation

1import pandas as pd
2
3results = [run_sample(s) for s in dataset]
4df = pd.DataFrame(results)
5
6summary = {
7 "samples": len(df),
8 "avg_wer": df["wer"].mean(),
9 "p95_wer": df["wer"].quantile(0.95),
10 "avg_latency_ms": df["latency_ms"].mean(),
11 "p95_latency_ms": df["latency_ms"].quantile(0.95),
12 "avg_rtf": df["rtf"].mean(),
13}
14print(summary)

Recommended metrics to report

  • WER / CER per use case and language.
  • End-to-end latency (p50 / p90 / p95) from latency_ms.
  • RTF from rtf — values below 1.0 mean Pulse processed audio faster than real-time. Pulse STT typically runs near 0.4 on clean inputs (see Metrics Overview).
  • Diarization coverage: share of utterances entries that include a speaker field.

5. Error analysis

1def breakdown(df):
2 worst = df.sort_values("wer", ascending=False).head(5)
3 return worst[["path", "wer", "transcription"]].to_dict(orient="records")
4
5outliers = breakdown(df)
  • Classify errors into substitutions, deletions, insertions (jiwer.process_words(ref, hyp).alignments gives the alignment).
  • Highlight audio traits (noise, accent, telephony codec) that correlate with higher WER.
  • For proper-noun mistakes, queue them for a pronunciation dictionary if you also use Lightning TTS, or for the Pulse keyword boosting feature on the streaming endpoint.

6. Compare configurations

1configs = [
2 {"language": "en", "word_timestamps": True},
3 {"language": "multi", "word_timestamps": True, "diarize": True},
4]
5
6def evaluate_config(config: dict, dataset: list) -> pd.DataFrame:
7 forwardable = {"language", "word_timestamps", "diarize"}
8 overridden = [{**s, **{k: v for k, v in config.items() if k in forwardable}} for s in dataset]
9 return pd.DataFrame([run_sample(s) for s in overridden])
10
11for config in configs:
12 cfg_df = evaluate_config(config, dataset)
13 print(config, cfg_df["wer"].mean(), cfg_df["latency_ms"].mean())

Use this to decide whether to enable diarization, sentence-level timestamps, or enrichment features — capture cost / latency impact alongside accuracy.

7. Publish the report

Include:

  1. Dataset description + rationale
  2. Metrics table (WER / CER / latency / RTF, p50 / p90 / p95)
  3. Error taxonomy with audio snippets
  4. Configuration recommendation (e.g., language=multi, word_timestamps=true, diarize=true)
  5. Follow-up experiments or model versions to track

Example JSON summary

1{
2 "dataset": "contact-center-q1",
3 "samples": 120,
4 "average_wer": 0.064,
5 "average_cer": 0.028,
6 "average_latency_ms": 61.3,
7 "p95_latency_ms": 88.2,
8 "average_rtf": 0.41,
9 "timestamp": "2026-05-22T10:00:00Z"
10}

The latency_ms you measure with time.perf_counter() includes network round-trip from your client to api.smallest.ai, so numbers vary with client region. Pulse’s published server-side TTFT of ~64 ms is measured at the inference pod (see the Pulse model card) — your wall-clock measurement is typically higher because of network. RTF stays meaningful either way: it’s processing_time / audio_duration, so longer audio dilutes per-request overhead.

This completes the process of self-metric evaluation. With these steps, you can identify strengths and weaknesses of Pulse STT — or compare it against any other ASR system that returns a similar response shape — on workloads representative of your production traffic.