Vercel AI SDK
Use Smallest AI as a speech and transcription provider in the Vercel AI SDK. Generate speech and transcribe audio with a few lines of code. The package also exposes streaming WebSocket transcription and voice cloning APIs that sit alongside the Vercel SpeechModelV2 / TranscriptionModelV2 interfaces.
Latest: smallestai-vercel-provider@0.6.2 — adds browser-native streaming (no proxy required), microphone capture hooks, auto-reconnect on socket drops (with counter reset across the session, so multi-hour streams survive sporadic blips), and a security-validated signedUrl flow for production browser apps. The SDK lazy-loads its ws dependency so browser-only consumers (using auth: 'query' or signedUrl) ship a smaller bundle and don’t need the bufferutil / serverExternalPackages setup that older versions required.
Installation
Setup
Get your API key from waves.smallest.ai and set it as an environment variable:
Text-to-Speech
Supported model: lightning-v3.1 — 44.1 kHz, natural expressive speech, voice cloning, ~100 ms latency, 22 languages plus auto for code-switching detection (see the Lightning v3.1 model card for the full list). The package also exports DEFAULT_LIGHTNING_MODEL so you don’t have to hard-code the id; bumping it on a new Lightning release flows through to every caller that imports the constant.
Pass outputFormat under providerOptions.smallestai.outputFormat, not as Vercel’s top-level outputFormat arg — the SDK rejects the top-level form with a warning. See TTS Options below.
Speech-to-Text (batch)
Next.js API Route Example
Create a TTS endpoint in your Next.js app:
Play it in the browser:
Provider Options
TTS Options
Batch STT Options
ageDetection was removed from the API and emits a deprecation warning if set.
itnNormalize, sentenceTimestamps, finalizeOnWords, maxWords, eouTimeoutMs are accepted only on the streaming WebSocket — TS will error if you set them on transcribe(). Use smallestai.transcriptionStream(...) (below) for those.
Streaming Speech-to-Text (WebSocket)
For real-time transcription (TTFT ~64 ms server-side), the SDK exposes a WebSocket session that wraps wss://api.smallest.ai/waves/v1/pulse/get_text with the canonical Authorization: Bearer flow. WS-only flags like itnNormalize, sentenceTimestamps, finalizeOnWords, maxWords, and eouTimeoutMs only take effect on this path.
One-shot helper for pre-recorded audio
Voice Cloning
The provider exposes the voice-cloning REST endpoints alongside TTS / STT. Defaults to lightning-v3.1; the legacy lightning-v2 model is rejected upstream.
Patterns & Caveats
Accumulate full_transcript client-side
The streaming API accepts fullTranscript: true, but the server-side full_transcript field is currently returned as an empty string on every frame. Concatenate is_final: true frames yourself instead:
The transcribeOnce() helper does this for you — use it for the pre-recorded case.
Browser streaming — three options
The default transcriptionStream() uses an Authorization: Bearer header that native browser WebSocket can’t set. Three patterns for browser apps, in order of recommendation:
A. Proxy via your server (recommended for production)
Server holds the API key, browser never sees it. The SDK ships a one-line helper that turns the stream into Server-Sent Events:
The browser parses the SSE response with the matching helper:
Next.js setup, one-time — only required for server-side auth: 'header' (the default, used by the SSE proxy above). Add this to next.config.{js,mjs,ts}:
And install the optional native deps so ws masks frames at native speed:
Browser-only consumers using auth: 'query' (option C) or signedUrl (option B) don’t need this — the SDK lazy-loads ws only when the Authorization header path is reached, so browser bundles never pull in ws or its Node-only deps.
B. Browser-native via signed URL (also production-grade)
Your server mints a short-lived URL on demand; the browser opens the WebSocket directly with that URL. Same security profile as (A) but with one less hop:
The SDK calls signedUrl() on every connect() and on every reconnect, so each session uses a fresh URL. Your server endpoint (/api/get-stream-url) decides how to scope and time-bound those URLs.
C. Browser-native with API key in URL (dev / internal apps only)
The API key appears in the WebSocket URL — visible in browser devtools, history, server access logs, and any error reporting tool that captures URLs. The SDK emits a one-time console.warn when this mode is used so it can’t be deployed unnoticed. Use only for dev / internal apps with per-user-scoped keys; for end-user production, use option (A) or (B).
Auto-reconnect on socket drops
Long-running sessions drop sockets for prosaic reasons (network blips, load-balancer recycling, idle timeouts). Pass autoReconnect: true and the SDK transparently re-opens with the same parameters and emits a synthetic { type: 'reconnected', attempt } frame so consumers can react:
Reconnect only fires on unexpected closes — is_last, an explicit closeStream(), and server-emitted error frames all terminate cleanly without retry.
maxReconnectAttempts counts consecutive failed attempts: the counter resets to zero after every successful reconnect, so a multi-hour stream that survives one blip per hour does not exhaust its retry budget across the whole session.
The optional 3rd argument to transcriptionStream(modelId, options, config) lets you override per-session connection config — the autoReconnect knobs above can also live there if you want them outside the WS-protocol options. The same slot accepts auth: 'query', signedUrl, signedUrlTimeoutMs, allowedSignedHosts, and suppressInsecureAuthWarning.
Microphone capture (browser)
Live captions and voice agents need raw mic data on the wire. The SDK ships two browser-side hooks for this:
useMicrophoneTranscription (high-level)
The all-in-one: captures the mic, streams chunks to your SSE proxy as a ReadableStream request body, exposes live transcript state.
The hook captures via getUserMedia + AudioWorklet, downsamples to linear16 @ 16 kHz mono, batches into ~100 ms chunks, and POSTs them as a streaming request body. Drop-oldest backpressure means a slow network never balloons memory.
useMicrophonePCM (low-level)
If you want the raw mic stream and your own pipe (custom WS, WebRTC, etc.):
The AudioWorklet processor is inlined as a Blob URL — no separate worklet file to host.
Security notes for browser deployments
The SDK enforces these guards on the new browser-native paths so you can’t accidentally ship insecure code:
What stays your job:
- CSRF-protect your SSE proxy and
signedUrlmint endpoints. - Rate-limit the proxy per-user — a malicious client can otherwise spam your route to burn API budget.
- Pick short token TTLs for
signedUrl(60 s is plenty — it only needs to live long enough for the browser to open the WS). - Never include user-controlled hosts in
allowedSignedHosts.
Available Voices
80+ voices across multiple languages. Popular voices:
Fetch the full voice list programmatically:

