Get started
Voice Cloning API

The most realistic voice cloning

Clone any voice from 5 seconds of audio. Instant cloning for rapid deployment, professional cloning for maximum fidelity. Every cloned voice runs on Inworld TTS with sub-200ms latency, emotion control, and 15-language support, ready for realtime applications at scale.

Clone any voice, three ways

Instant cloning for rapid deployment and user-generated voices. Professional cloning for maximum fidelity where it matters most.

Instant voice cloning

Create a custom voice from 5 to 15 seconds of audio. Zero-shot, available to all users in the Portal. Upload or record a sample and start generating speech in minutes.

Voice Cloning API

Automate voice creation programmatically. Let your users clone their own voices during onboarding, or batch-create voices for content workflows. JavaScript and Python examples on GitHub.

Professional voice cloning

Fine-tuned from 30+ minutes of clean audio. Recommended for uncommon voice types: children's voices, unique accents, or brand voices where instant cloning may not capture the full characteristics. Available by contacting the Inworld team.

Built for realtime applications

Every cloned voice runs on Inworld TTS — with the same production-grade latency, expressiveness, and multilingual support.

Production-grade latency

<200ms median first chunk for TTS-1.5 Max. <100ms for TTS-1.5 Mini. Streaming-native over WebSocket. No buffering delay.

Expression and emotion control

Audio markups for emotion ([happy], [sad], [angry]), delivery style ([whispering], [laughing]), and non-verbal vocalizations ([breathe], [cough], [sigh]). SSML break tags for precise silence placement.

15 languages

English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, Italian, Dutch, Polish, Portuguese, Russian, Arabic, and Hebrew. Native-speaker quality in every language.

Half a cent per minute

$5/1M characters with TTS-1.5 Mini. $10/1M characters with TTS-1.5 Max. On-prem deployment for enterprises with data sovereignty requirements.

Full breakdown

FeatureInstant Voice CloningProfessional Voice Cloning
Audio required5 to 15 seconds30+ minutes (5 min minimum, 20+ recommended)
AvailabilityAll users via Portal and APIContact sales
Best forMost applications, rapid prototyping, user-generated cloningUncommon voice types, brand voices, maximum fidelity
Supported formatsWAV, MP3, WebM (max 4MB)Contact sales
Languages15 languages15 languages
Emotion and audio markupsSupportedSupported
Timestamp alignmentWord, character, phoneme, visemeWord, character, phoneme, viseme
On-premise deploymentAvailable (H100/B200)Available (H100/B200)
Zero data retentionAvailableAvailable

Use cases

Social & Tech

AI Companions and Social Apps

Persistent voice identity across sessions. Your companion sounds the same every time a user comes back.

Education

Language Learning

Clone instructor voices for consistent tutoring experiences across 15 languages.

Creative

Content Production

Produce podcasts, narration, and video voiceovers in your own voice without being in the booth.

Interactive

Gaming and Interactive Media

Clone character voices for dynamic in-game dialogue. Scale voice production without booking studio time for every line.

Impact

Accessibility

Preserve a person's voice for text-to-speech interfaces. Clone from a short sample and give users a voice that sounds like them.

How Inworld compares

InworldElevenLabsCartesia
Min audio to clone5 seconds~60 seconds3 seconds
Latency (first audio)<200ms (Max), <100ms (Mini)~300–400ms~40ms (base model)
Cost per 1M characters$5 (Mini) / $10 (Max)$11–$99$15
Languages152917
Quality ranking#1 Artificial Analysis TTS Arena
On-premise deployment✓ H100/B200
Zero data retentionEnterprise only
Emotion & audio markupsPartial

Latency measured as median time-to-first-audio. Cost as of March 2026. Quality ranking from Artificial Analysis TTS Arena independent blind evaluation.

FAQ

AI voice cloning creates a digital replica of a specific voice from audio samples. The cloned voice can then generate new speech that sounds like the original speaker. Inworld offers instant voice cloning from as little as 5 seconds of audio, and professional voice cloning from 30+ minutes for maximum fidelity.
Open the TTS Playground in the Inworld Portal. Click Create Voice, then Clone. Upload or record 5 to 15 seconds of audio. Your cloned voice is ready to use in the Playground or via API within minutes. For automation, use the Voice Cloning API with JavaScript or Python.
Instant cloning: 5 to 15 seconds. Professional cloning: 5 minutes minimum, 20 to 30+ minutes recommended. For instant cloning, record in a quiet environment and speak with varied emotion to capture the full range of the voice.
Yes. You must confirm you have the rights to clone the voice during the creation process.
All 15 supported languages: English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, Italian, Dutch, Polish, Portuguese, Russian, Arabic, and Hebrew. Voices perform best when the synthesized text matches the language of the original sample.
Instant cloning (zero-shot) creates a usable voice from 5 to 15 seconds of audio. Available to all users. Professional cloning fine-tunes a model with 30+ minutes of audio for maximum fidelity. Recommended for uncommon voice types where instant cloning may not capture the full characteristics.
Yes. Automate voice creation programmatically. Useful for onboarding flows where users clone their own voice, or batch workflows. Code examples in JavaScript and Python on GitHub.
Cloning itself is free. You pay for speech synthesis: $10/1M characters with TTS-1.5 Max, $5/1M characters with TTS-1.5 Mini.
Yes. Audio markups for emotion ([happy], [sad], [angry]), delivery style ([whispering], [laughing]), and non-verbal vocalizations ([breathe], [cough], [sigh]). SSML break tags for precise silence. Emotion markups are currently experimental, English only.

Start building

Join millions of developers building the next wave of AI applications.
Copyright © 2021-2026 Inworld AI