Vercel AI SDK

View as Markdown

Use Smallest AI as a speech and transcription provider in the Vercel AI SDK. Generate speech and transcribe audio with a few lines of code. The package also exposes streaming WebSocket transcription and voice cloning APIs that sit alongside the Vercel SpeechModelV2 / TranscriptionModelV2 interfaces.

Latest: smallestai-vercel-provider@0.6.2 — adds browser-native streaming (no proxy required), microphone capture hooks, auto-reconnect on socket drops (with counter reset across the session, so multi-hour streams survive sporadic blips), and a security-validated signedUrl flow for production browser apps. The SDK lazy-loads its ws dependency so browser-only consumers (using auth: 'query' or signedUrl) ship a smaller bundle and don’t need the bufferutil / serverExternalPackages setup that older versions required.

Installation

$npm install smallestai-vercel-provider ai

Setup

Get your API key from waves.smallest.ai and set it as an environment variable:

$export SMALLEST_API_KEY="your_key_here"

Text-to-Speech

Supported model: lightning-v3.1 — 44.1 kHz, natural expressive speech, voice cloning, ~100 ms latency, 22 languages plus auto for code-switching detection (see the Lightning v3.1 model card for the full list). The package also exports DEFAULT_LIGHTNING_MODEL so you don’t have to hard-code the id; bumping it on a new Lightning release flows through to every caller that imports the constant.

1import { experimental_generateSpeech as generateSpeech } from 'ai';
2import {
3 smallestai,
4 DEFAULT_LIGHTNING_MODEL,
5} from 'smallestai-vercel-provider';
6
7const { audio } = await generateSpeech({
8 model: smallestai.speech(DEFAULT_LIGHTNING_MODEL),
9 text: 'Hello from Smallest AI!',
10 voice: 'sophia',
11 language: 'auto', // 'en', 'hi', 'es', ... — defaults to 'auto'
12 speed: 1.0,
13});
14
15// audio.uint8Array — raw WAV bytes
16// audio.base64 — base64-encoded audio

Pass outputFormat under providerOptions.smallestai.outputFormat, not as Vercel’s top-level outputFormat arg — the SDK rejects the top-level form with a warning. See TTS Options below.

Speech-to-Text (batch)

1import { experimental_transcribe as transcribe } from 'ai';
2import { smallestai } from 'smallestai-vercel-provider';
3import { readFileSync } from 'fs';
4
5const { text, segments, durationInSeconds } = await transcribe({
6 model: smallestai.transcription('pulse'),
7 audio: readFileSync('recording.wav'),
8 mediaType: 'audio/wav',
9});
10
11console.log(text);

Next.js API Route Example

Create a TTS endpoint in your Next.js app:

1// app/api/speak/route.ts
2import { experimental_generateSpeech as generateSpeech } from 'ai';
3import { smallestai } from 'smallestai-vercel-provider';
4
5export const runtime = 'nodejs';
6
7export async function POST(req: Request) {
8 const { text, voice } = await req.json();
9
10 const { audio } = await generateSpeech({
11 model: smallestai.speech('lightning-v3.1'),
12 text,
13 voice: voice || 'sophia',
14 });
15
16 return new Response(Buffer.from(audio.uint8Array), {
17 headers: { 'Content-Type': 'audio/wav' },
18 });
19}

Play it in the browser:

1const res = await fetch('/api/speak', {
2 method: 'POST',
3 body: JSON.stringify({ text: 'Hello!', voice: 'sophia' }),
4});
5const blob = await res.blob();
6new Audio(URL.createObjectURL(blob)).play();

Provider Options

TTS Options

1const { audio } = await generateSpeech({
2 model: smallestai.speech('lightning-v3.1'),
3 text: 'Hello!',
4 voice: 'robert',
5 providerOptions: {
6 smallestai: {
7 sampleRate: 44100, // 8000 | 16000 | 24000 | 44100
8 outputFormat: 'mp3', // pcm | mp3 | wav | ulaw | alaw
9 similarity: 0.5, // voice similarity (0–1)
10 enhancement: 1, // audio enhancement (0 | 1 | 2)
11 addWavHeader: false,
12 saveHistory: false,
13 pronunciationDicts: ['<dict-id>'],
14 },
15 },
16});

Batch STT Options

1const result = await transcribe({
2 model: smallestai.transcription('pulse'),
3 audio: audioBuffer,
4 mediaType: 'audio/wav',
5 providerOptions: {
6 smallestai: {
7 language: 'multi', // 'en' | 'hi' | 'multi' | 'multi-eu' | 'multi-asian' | 'multi-indic' — see Pulse model card
8 diarize: true,
9 emotionDetection: true,
10 genderDetection: true,
11 wordTimestamps: true,
12
13 // Privacy
14 redactPii: true, // names, addresses → [FIRSTNAME_1] etc.
15 redactPci: true, // card #s, CVV → [CREDITCARDCVV_1] etc.
16
17 // Formatting
18 numerals: 'auto', // 'true' | 'false' | 'auto'
19 punctuate: true,
20 capitalize: true,
21
22 // Async webhook delivery
23 webhookUrl: 'https://example.com/asr-webhook',
24 webhookMethod: 'POST',
25 webhookExtra: 'job_id:abc123',
26 },
27 },
28});

ageDetection was removed from the API and emits a deprecation warning if set.

itnNormalize, sentenceTimestamps, finalizeOnWords, maxWords, eouTimeoutMs are accepted only on the streaming WebSocket — TS will error if you set them on transcribe(). Use smallestai.transcriptionStream(...) (below) for those.

Streaming Speech-to-Text (WebSocket)

For real-time transcription (TTFT ~64 ms server-side), the SDK exposes a WebSocket session that wraps wss://api.smallest.ai/waves/v1/pulse/get_text with the canonical Authorization: Bearer flow. WS-only flags like itnNormalize, sentenceTimestamps, finalizeOnWords, maxWords, and eouTimeoutMs only take effect on this path.

1import { smallestai } from 'smallestai-vercel-provider';
2import { readFileSync } from 'fs';
3
4const stream = smallestai.transcriptionStream('pulse', {
5 language: 'en',
6 encoding: 'linear16',
7 sampleRate: 16000,
8 wordTimestamps: true,
9 diarize: true,
10 redactPii: true,
11 redactPci: true,
12 numerals: 'auto',
13 itnNormalize: true,
14 sentenceTimestamps: true,
15 keywords: ['NVIDIA:5', 'Jensen'],
16});
17
18await stream.connect();
19
20// Stream raw PCM s16le @ 16 kHz mono
21const pcm = readFileSync('audio.s16le');
22for (let i = 0; i < pcm.length; i += 32 * 1024) {
23 stream.sendAudio(pcm.subarray(i, i + 32 * 1024));
24}
25stream.closeStream(); // server flushes, emits is_last: true, then closes
26
27let fullTranscript = '';
28for await (const msg of stream) {
29 if (!msg.is_final) {
30 console.log('partial:', msg.transcript);
31 } else {
32 console.log('final:', msg.transcript);
33 fullTranscript += (fullTranscript ? ' ' : '') + (msg.transcript || '');
34 }
35 if (msg.is_last) break;
36}
37console.log('full transcript:', fullTranscript);

One-shot helper for pre-recorded audio

1import {
2 smallestai,
3 SmallestAITranscriptionStream,
4} from 'smallestai-vercel-provider';
5
6const stream = smallestai.transcriptionStream('pulse', {
7 language: 'en', encoding: 'linear16', sampleRate: 16000,
8 wordTimestamps: true, sentenceTimestamps: true, itnNormalize: true,
9});
10
11const { transcript } = await SmallestAITranscriptionStream.transcribeOnce(
12 stream,
13 audioBytes,
14);

Voice Cloning

The provider exposes the voice-cloning REST endpoints alongside TTS / STT. Defaults to lightning-v3.1; the legacy lightning-v2 model is rejected upstream.

1import { smallestai } from 'smallestai-vercel-provider';
2import { readFileSync } from 'fs';
3
4// Create an instant clone
5const clone = await smallestai.voiceClone.create({
6 file: readFileSync('my-voice.wav'),
7 fileName: 'my-voice.wav',
8 displayName: 'My voice',
9 description: 'Warm narrator',
10 language: 'en',
11});
12console.log(clone.voiceId); // → "voice_abc123"
13
14// List all clones in your org
15const all = await smallestai.voiceClone.list();
16
17// Use the cloned voice in TTS
18const { audio } = await generateSpeech({
19 model: smallestai.speech('lightning-v3.1'),
20 text: 'Hello in my own voice.',
21 voice: clone.voiceId,
22});
23
24// Delete when done
25await smallestai.voiceClone.delete(clone.voiceId);

Patterns & Caveats

Accumulate full_transcript client-side

The streaming API accepts fullTranscript: true, but the server-side full_transcript field is currently returned as an empty string on every frame. Concatenate is_final: true frames yourself instead:

1let fullTranscript = '';
2for await (const msg of stream) {
3 if (msg.is_final && msg.transcript) {
4 fullTranscript += (fullTranscript ? ' ' : '') + msg.transcript;
5 }
6 if (msg.is_last) break;
7}

The transcribeOnce() helper does this for you — use it for the pre-recorded case.

Browser streaming — three options

The default transcriptionStream() uses an Authorization: Bearer header that native browser WebSocket can’t set. Three patterns for browser apps, in order of recommendation:

Server holds the API key, browser never sees it. The SDK ships a one-line helper that turns the stream into Server-Sent Events:

1// app/api/transcribe-stream/route.ts (Next.js, Node runtime)
2import {
3 smallestai,
4 createTranscriptionStreamSSEResponse,
5} from 'smallestai-vercel-provider';
6
7export const runtime = 'nodejs';
8
9export async function POST(req: Request) {
10 const audio = new Uint8Array(await req.arrayBuffer());
11 const stream = smallestai.transcriptionStream('pulse', {
12 language: 'en',
13 encoding: 'linear16',
14 sampleRate: 16000,
15 wordTimestamps: true,
16 itnNormalize: true,
17 });
18 await stream.connect();
19 for (let i = 0; i < audio.length; i += 32 * 1024) {
20 stream.sendAudio(audio.subarray(i, i + 32 * 1024));
21 }
22 stream.closeStream();
23 return createTranscriptionStreamSSEResponse(stream, { signal: req.signal });
24}

The browser parses the SSE response with the matching helper:

1import { parseTranscriptionStreamSSE } from 'smallestai-vercel-provider';
2
3const res = await fetch('/api/transcribe-stream', { method: 'POST', body: audioBytes });
4for await (const msg of parseTranscriptionStreamSSE(res)) {
5 if (msg.is_final) console.log(msg.transcript);
6 if (msg.is_last) break;
7}

Next.js setup, one-time — only required for server-side auth: 'header' (the default, used by the SSE proxy above). Add this to next.config.{js,mjs,ts}:

1/** @type {import('next').NextConfig} */
2const nextConfig = {
3 serverExternalPackages: ['smallestai-vercel-provider', 'ws'],
4};
5export default nextConfig;

And install the optional native deps so ws masks frames at native speed:

$npm install bufferutil utf-8-validate

Browser-only consumers using auth: 'query' (option C) or signedUrl (option B) don’t need this — the SDK lazy-loads ws only when the Authorization header path is reached, so browser bundles never pull in ws or its Node-only deps.

B. Browser-native via signed URL (also production-grade)

Your server mints a short-lived URL on demand; the browser opens the WebSocket directly with that URL. Same security profile as (A) but with one less hop:

1// Browser code:
2const stream = smallestai.transcriptionStream('pulse', {
3 language: 'en',
4 encoding: 'linear16',
5 sampleRate: 16000,
6}, {
7 signedUrl: async () => {
8 const res = await fetch('/api/get-stream-url');
9 return (await res.json()).url; // wss://api.smallest.ai/...
10 },
11});
12await stream.connect();

The SDK calls signedUrl() on every connect() and on every reconnect, so each session uses a fresh URL. Your server endpoint (/api/get-stream-url) decides how to scope and time-bound those URLs.

C. Browser-native with API key in URL (dev / internal apps only)

1const stream = smallestai.transcriptionStream('pulse', {
2 language: 'en', encoding: 'linear16', sampleRate: 16000,
3}, {
4 apiKey: 'sk_...',
5 auth: 'query',
6});

The API key appears in the WebSocket URL — visible in browser devtools, history, server access logs, and any error reporting tool that captures URLs. The SDK emits a one-time console.warn when this mode is used so it can’t be deployed unnoticed. Use only for dev / internal apps with per-user-scoped keys; for end-user production, use option (A) or (B).

Auto-reconnect on socket drops

Long-running sessions drop sockets for prosaic reasons (network blips, load-balancer recycling, idle timeouts). Pass autoReconnect: true and the SDK transparently re-opens with the same parameters and emits a synthetic { type: 'reconnected', attempt } frame so consumers can react:

1const stream = smallestai.transcriptionStream('pulse', {
2 language: 'en',
3 encoding: 'linear16',
4 sampleRate: 16000,
5 autoReconnect: true,
6 maxReconnectAttempts: 5, // default 5
7 reconnectBackoffMs: 500, // exponential, capped at 30s
8});
9
10for await (const msg of stream) {
11 if (msg.type === 'reconnected') {
12 console.log(`recovered after ${msg.attempt} attempt(s)`);
13 continue;
14 }
15 // ... normal transcript handling
16}

Reconnect only fires on unexpected closes — is_last, an explicit closeStream(), and server-emitted error frames all terminate cleanly without retry.

maxReconnectAttempts counts consecutive failed attempts: the counter resets to zero after every successful reconnect, so a multi-hour stream that survives one blip per hour does not exhaust its retry budget across the whole session.

The optional 3rd argument to transcriptionStream(modelId, options, config) lets you override per-session connection config — the autoReconnect knobs above can also live there if you want them outside the WS-protocol options. The same slot accepts auth: 'query', signedUrl, signedUrlTimeoutMs, allowedSignedHosts, and suppressInsecureAuthWarning.

Microphone capture (browser)

Live captions and voice agents need raw mic data on the wire. The SDK ships two browser-side hooks for this:

useMicrophoneTranscription (high-level)

The all-in-one: captures the mic, streams chunks to your SSE proxy as a ReadableStream request body, exposes live transcript state.

1'use client';
2import { useMicrophoneTranscription } from 'smallestai-vercel-provider/react';
3
4export function LiveCaptions() {
5 const {
6 transcript, partial,
7 isCapturing, isStreaming,
8 chunksDelivered, chunksDropped,
9 start, stop, reset,
10 } = useMicrophoneTranscription({ apiPath: '/api/transcribe-mic-stream' });
11
12 return (
13 <>
14 <button onClick={isCapturing ? stop : () => start()}>
15 {isCapturing ? 'Stop' : 'Start'}
16 </button>
17 <p>{transcript}{partial && <em> {partial}</em>}</p>
18 {chunksDropped > 0 && <small>⚠ {chunksDropped} chunks dropped (lagging)</small>}
19 </>
20 );
21}

The hook captures via getUserMedia + AudioWorklet, downsamples to linear16 @ 16 kHz mono, batches into ~100 ms chunks, and POSTs them as a streaming request body. Drop-oldest backpressure means a slow network never balloons memory.

useMicrophonePCM (low-level)

If you want the raw mic stream and your own pipe (custom WS, WebRTC, etc.):

1import { useMicrophonePCM } from 'smallestai-vercel-provider/react';
2
3const { start, stop, isCapturing, chunksDropped } = useMicrophonePCM({
4 sampleRate: 16000,
5 batchMs: 100,
6 maxQueuedChunks: 50,
7 onChunk: (chunk) => myPipe.send(chunk),
8});

The AudioWorklet processor is inlined as a Blob URL — no separate worklet file to host.

Security notes for browser deployments

The SDK enforces these guards on the new browser-native paths so you can’t accidentally ship insecure code:

GuardWhat it blocks
signedUrl() results must be wss:TLS-stripping attacks. ws://localhost only works when you opt in via allowedSignedHosts: ['localhost'].
signedUrl() host must match baseURL hostA bug in your signing endpoint can’t redirect audio to attacker.com. Add additional hosts via allowedSignedHosts.
signedUrlTimeoutMs (default 10 s)A hung signing endpoint fast-fails instead of stalling forever.
auth: 'query' console warningOne-time warning makes URL-based auth visible in dev so it can’t deploy unnoticed. Suppress via suppressInsecureAuthWarning: true after audit.

What stays your job:

  • CSRF-protect your SSE proxy and signedUrl mint endpoints.
  • Rate-limit the proxy per-user — a malicious client can otherwise spam your route to burn API budget.
  • Pick short token TTLs for signedUrl (60 s is plenty — it only needs to live long enough for the browser to open the WS).
  • Never include user-controlled hosts in allowedSignedHosts.

Available Voices

80+ voices across multiple languages. Popular voices:

VoiceGenderAccentBest For
sophiaFemaleAmericanGeneral use (default)
robertMaleAmericanProfessional
advikaFemaleIndianHindi, code-switching
vivaanMaleIndianBilingual English/Hindi
camillaFemaleMexican/LatinSpanish

Fetch the full voice list programmatically:

$curl -s "https://api.smallest.ai/waves/v1/lightning-v3.1/get_voices" \
> -H "Authorization: Bearer $SMALLEST_API_KEY"