For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
DocumentationAPI ReferenceSelf HostModel CardsClient LibrariesIntegrationsDeveloper ToolsChangelog
  • Getting Started
    • Introduction
    • Models
    • Authentication
  • Text to Speech (Lightning)
    • Quickstart
    • Overview
    • Sync & Async
    • Streaming
    • Pronunciation Dictionaries
    • Voices & Languages
    • HTTP vs Streaming vs WebSockets
  • Speech to Text (Pulse)
    • Quickstart
    • Overview
  • LLM (Electron)
    • Quickstart
    • Overview
    • Chat Completions
    • Streaming
    • Tool / Function Calling
    • Prefix Caching
    • Supported Parameters
    • Migrate from OpenAI
    • Best Practices
  • Cookbooks
    • Speech to Text
    • Text to Speech
    • Voice Agent (Electron + Pulse + Lightning)
  • Voice Cloning
    • Instant Clone (UI)
    • Instant Clone (API)
    • Instant Clone (Python SDK)
    • Delete Cloned Voice
  • Best Practices
    • Voice Cloning Best Practices
    • TTS Best Practices
  • Troubleshooting
    • Error reference
LogoLogo
Voice AgentsModels
Voice AgentsModels
On this page
  • Recording Reference Audio
  • Environment
  • Speaking Style
  • Audio Length
  • Multi-Lingual Voice Cloning
  • Language Matching
  • Accent Retention
  • Language Group Constraints
  • Expressive Cloning
  • Emotional Control
  • Speed Control
  • Volume Control
  • Reference Audio Examples
  • Good Reference Audio
  • Bad Reference Audio
  • Expressive Audio Examples
  • Angry Tone
  • Whisper Tone
  • Fast-Paced Tone
Best Practices

Voice Cloning Best Practices

||View as Markdown|
Was this page helpful?
Previous

Delete a Voice Clone

Next

TTS Best Practices

Built with

High-quality reference audio is the single most important factor in clone quality. These guidelines cover recording environment, speaking style, multi-lingual cloning, and expressive control.

Try Voice Cloning

Clone a voice directly in the console. 5-15 seconds of audio, no code required.


Recording Reference Audio

Environment

  • Record in a quiet room with minimal background noise. Ambient noise, hiss, or rumble will be captured in the clone.
  • Use a dedicated microphone when possible. MacBook and mobile device microphones are acceptable if positioned at an appropriate distance to avoid distortion.
  • Avoid rooms with echo (large empty spaces, outdoor areas). Small treated rooms produce the best results.
  • After recording, listen back to the audio before uploading. Verify it is free of interruptions, clipping, or background interference.

Speaking Style

  • Speak naturally in your normal conversational voice. The model captures timbre, accent, emotional tone, rhythm, and pacing automatically.
  • Maintain a consistent pace throughout the recording. Avoid long pauses, as they can degrade clone quality.
  • Do not exaggerate emotion unless a specific tone is the intended output (see Expressive Cloning below).

Audio Length

  • Provide 5 to 15 seconds of clean, continuous speech.

Multi-Lingual Voice Cloning

Language Matching

For best results, record reference audio in the same language as your intended output. The model supports cross-lingual cloning (e.g., English reference audio used for Spanish output), but a language-matched reference will always produce higher fidelity.

ScenarioExpected Quality
Reference and output in the same languageBest results. Highest phonetic accuracy.
Reference in a different language than outputFunctional. Voice characteristics transfer, but the source accent is retained.

Accent Retention

When synthesizing in a different language than the reference audio, the original accent is preserved. A clone from a South Indian English speaker will retain that accent when generating Hindi or Tamil output. This is by design: the clone reproduces your voice, including accent characteristics.

If accent-neutral output is required for a specific language, provide reference audio recorded by a native speaker of that language.

Language Group Constraints

Cloned voices follow the same language group routing rules as standard synthesis. See Code-Switching for details on Indic and Global group restrictions.


Expressive Cloning

The model captures emotional and prosodic characteristics from the reference audio. The tone, pace, and volume of the reference directly influence the synthesized output.

Emotional Control

The emotion conveyed in the reference audio (e.g., calm, happy, angry) is reflected in the generated speech. To produce an angry-sounding clone, provide an angry reference. To produce a neutral clone, provide a neutral reference.

Speed Control

The pace of the reference audio determines the output speed. A fast-paced reference produces faster delivery; a slower reference produces more measured output.

Volume Control

The volume level in the reference audio carries over to the output. A soft-spoken reference produces quieter output; a louder, more energetic recording produces bolder output.


Reference Audio Examples

Audio samples are embedded as video due to platform constraints.

Good Reference Audio

Clear, consistent tone with no background noise.

Bad Reference Audio

Background noise present.

Inconsistent speaking style.

Overlapping voices.


Expressive Audio Examples

Angry Tone

Reference:

Output:

Whisper Tone

Reference:

Output:

Fast-Paced Tone

Reference:

Output: