For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Prerequisites
    • Why Self-Host?
    • Architecture
  • Docker Setup
      • Quick Start
      • Cloud Deployment
      • Parallelism and Latency
      • Services Overview
      • Configuration
      • Multi-checkpoint deployment
      • Troubleshooting
  • Kubernetes Setup
    • Quick Start
    • Troubleshooting
  • Troubleshooting
    • Common Issues
    • Debugging Guide
    • Logs Analysis
  • API Reference
    • Authentication
    • Examples
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • AWS
  • GCP
  • Azure
  • Picking between on-demand, reserved, and spot
  • Container image
  • Next steps
Docker SetupSTT Deployment

Cloud Deployment

||View as Markdown|
Was this page helpful?
Previous

Quick Start

Next

Parallelism and Latency

Built with

This page lists the recommended cloud instance types for self-hosting Pulse and Pulse Pro. L4 is the recommended GPU class for STT across all three clouds; larger GPUs (L40S, A100, H100) deliver higher throughput when needed. The broader hardware requirements page lists every supported GPU class.

Pricing varies by region, commitment term (on-demand, 1-year reserved, 3-year reserved, spot), and quota availability. Confirm rate and quota in your account before sizing a deployment.

AWS

TierInstance typeGPUvCPURAMNotes
Recommendedg6.xlarge1× NVIDIA L4 (24 GB)416 GBProduction reference for both Pulse and Pulse Pro. Cost-efficient and broadly available across regions.
Higher throughputg6e.xlarge1× NVIDIA L40S (48 GB)432 GBHigher RTFx if you need it. Reference for internal benchmark numbers.
Budgetg4dn.xlarge1× NVIDIA T4 (16 GB)416 GBOlder T4 still supported; reduced throughput.

Region availability: L4 (G6), L40S (G6E), and T4 (G4dn) families are available in major regions (us-east-1, us-west-2, eu-west-1, ap-south-1). Check the EC2 instance availability matrix for your target region before deployment.

GCP

TierMachine typeGPUvCPURAMNotes
Recommendedg2-standard-41× NVIDIA L4 (24 GB)416 GBProduction reference for both Pulse and Pulse Pro.
Higher throughputa2-highgpu-1g1× NVIDIA A100 (40 GB)1285 GBHigher cost, materially higher throughput than L4.
Budgetn1-standard-4 + T41× NVIDIA T4 (16 GB)415 GBT4 still supported; reduced throughput.

Region availability: G2 (L4) family is available in us-central1, us-east4, us-west4, europe-west4, asia-southeast1 among others. Confirm in GCP regions & zones.

Azure

TierVM sizeGPUvCPURAMNotes
RecommendedStandard_NC4as_T4_v31× NVIDIA T4 (16 GB)428 GBClosest stable equivalent to L4 on Azure today. L4 is not yet GA on Azure.
Higher throughputStandard_NC24ads_A100_v41× NVIDIA A100 (80 GB)24220 GBMaterially higher throughput than T4; pick when accuracy SLOs require speed.

Region availability: NC A100 v4 is available in eastus2, southcentralus, westeurope, southeastasia among others. Confirm in the Azure GPU regions list.

Picking between on-demand, reserved, and spot

  • On-demand for proof-of-concept and bursty workloads.
  • 1-year or 3-year reserved for steady-state production. Materially reduces hourly rate (commonly 30–60% vs on-demand).
  • Spot / preemptible for batch / overnight transcription. Cheapest option, but the worker can be preempted; build your queue to tolerate restarts.

Container image

Once you have the GPU host, follow the Quick Start to pull the Pulse / Pulse Pro container image and run the worker. The image runs identically across AWS, GCP, and Azure; the only difference is the GPU driver version on the host.

L4 is the recommended GPU class for self-hosting Pulse and Pulse Pro across all three clouds. The Azure recommendation falls back to T4 because L4 is not yet GA on Azure. If your workload is latency-sensitive or you have committed capacity for a different class, contact support@smallest.ai for sizing guidance.

Next steps

  • Hardware requirements: minimum and recommended GPU specs.
  • Parallelism and latency: RTFx, RPS, and latency by mode.
  • Quick Start: deploy the STT worker.