High-quality reference audio is the single most important factor in clone quality. These guidelines cover recording environment, speaking style, multi-lingual cloning, and expressive control.
For best results, record reference audio in the same language as your intended output. The model supports cross-lingual cloning (e.g., English reference audio used for Spanish output), but a language-matched reference will always produce higher fidelity.
When synthesizing in a different language than the reference audio, the original accent is preserved. A clone from a South Indian English speaker will retain that accent when generating Hindi or Tamil output. This is by design: the clone reproduces your voice, including accent characteristics.
If accent-neutral output is required for a specific language, provide reference audio recorded by a native speaker of that language.
Cloned voices follow the same language group routing rules as standard synthesis. See Code-Switching for details on Indic and Global group restrictions.
The model captures emotional and prosodic characteristics from the reference audio. The tone, pace, and volume of the reference directly influence the synthesized output.
The emotion conveyed in the reference audio (e.g., calm, happy, angry) is reflected in the generated speech. To produce an angry-sounding clone, provide an angry reference. To produce a neutral clone, provide a neutral reference.
The pace of the reference audio determines the output speed. A fast-paced reference produces faster delivery; a slower reference produces more measured output.
The volume level in the reference audio carries over to the output. A soft-spoken reference produces quieter output; a louder, more energetic recording produces bolder output.
Audio samples are embedded as video due to platform constraints.
Clear, consistent tone with no background noise.
Background noise present.
Inconsistent speaking style.
Overlapping voices.
Reference:
Output:
Reference:
Output:
Reference:
Output: