Voice Cloning

The most realistic voice cloning

Clone any voice from 5 to 15 seconds of audio. Instant cloning for rapid deployment, professional cloning for maximum fidelity. Every cloned voice runs on Realtime TTS with sub-200ms latency, emotion control, and support for over 100 languages, ready for realtime applications at scale.

Cloning is free. You pay only for the speech you generate.

Clone any voice, three ways

Instant cloning for rapid deployment and user-generated voices. Professional cloning for maximum fidelity where it matters most.

Instant voice cloning

Create a custom voice from 5 to 15 seconds of audio. Zero-shot, available to all users in the Portal. Upload or record a sample and start generating speech in minutes.

Voice Cloning API

Automate voice creation programmatically. Let your users clone their own voices during onboarding, or batch-create voices for content workflows. JavaScript and Python examples on GitHub.

Professional voice cloning

Fine-tuned from 30+ minutes of clean audio. Recommended for uncommon voice types: children's voices, unique accents, or brand voices where instant cloning may not capture the full characteristics. Available by contacting the Inworld team.

Built for realtime applications

Every cloned voice runs on Realtime TTS with the same production-grade latency, expressiveness, and multilingual support, and drops straight into the Realtime API alongside speech-to-text and LLM routing on one bill.

Production-grade latency

<200ms median first chunk for Realtime TTS 1.5 Max. <100ms for Realtime TTS 1.5 Mini. Streaming-native over WebSocket. No buffering delay.

Expression and emotion control

Audio markups for emotion ([happy], [sad], [angry]), delivery style ([whispering], [laughing]), and non-verbal vocalizations ([breathe], [cough], [sigh]). SSML break tags for precise silence placement.

over 100 languages

English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-speaker quality in every language, with cross-lingual cloning on Realtime TTS-2.

Voice AI down to $0.005/min

Down to $5 per 1M characters with Realtime TTS 1.5 Mini and down to $10 per 1M with Realtime TTS-2. New pricing cuts rates in half or more for most developers, and your rate keeps falling as you scale. On-prem deployment for enterprises with data sovereignty requirements.

Full breakdown

FeatureInstant Voice CloningProfessional Voice Cloning
Audio required5 to 15 seconds30+ minutes (5 min minimum, 20+ recommended)
AvailabilityAll users via Portal and APIContact sales
Best forMost applications, rapid prototyping, user-generated cloningUncommon voice types, brand voices, maximum fidelity
Supported formatsWAV, MP3, WebM (max 4MB)Contact sales
Languagesover 100 languagesover 100 languages
Emotion and audio markupsSupportedSupported
Timestamp alignmentWord, character, phoneme, visemeWord, character, phoneme, viseme
On-premise deploymentAvailable (H100/B200) Talk to our teamAvailable (H100/B200) Talk to our team
Zero data retentionAvailableAvailable

Use cases

Social & Tech

AI Companions and Social Apps

Persistent voice identity across sessions. Your companion sounds the same every time a user comes back.

Education

Language Learning

Clone instructor voices for consistent tutoring experiences across over 100 languages.

Creative

Content Production

Produce podcasts, narration, and video voiceovers in your own voice without being in the booth.

Interactive

Interactive Media

Clone talent and narrator voices for dynamic interactive media. Scale voice production without booking studio time for every line.

Impact

Accessibility

Preserve a person's voice for text-to-speech interfaces. Clone from a short sample and give users a voice that sounds like them.

How Inworld compares

InworldElevenLabsCartesia
Min audio to clone5 seconds~60 seconds3 seconds
Latency (first audio)<200ms (Max), <100ms (Mini)~300–400ms~40ms (base model)
Cost per 1M charactersDown to $5 (Mini) / $10 (TTS-2)$100 (standard)~$35†
Languagesover 100 (Realtime TTS-2)70+ (Eleven v3)17
On-premise deployment✓ H100/B200
Zero data retentionEnterprise only
Emotion & audio markupsPartial

Latency measured as median time-to-first-audio. Cost per 1M characters from provider pricing pages, June 2026. Cartesia Sonic 3.5 is an estimated effective rate derived from published tier pricing.

FAQ

AI voice cloning creates a digital replica of a specific voice from audio samples. The cloned voice can then generate new speech that sounds like the original speaker. Inworld offers instant voice cloning from 5 to 15 seconds of audio, and professional voice cloning from 30+ minutes for maximum fidelity.
Open the TTS Playground in the Inworld Portal. Click Create Voice, then Clone. Upload or record 5 to 15 seconds of audio. Your cloned voice is ready to use in the Playground or via API within minutes. For automation, use the Voice Cloning API with JavaScript or Python.
Instant cloning: 5 to 15 seconds. Professional cloning: 5 minutes minimum, 20 to 30+ minutes recommended. For instant cloning, record in a quiet environment and speak with varied emotion to capture the full range of the voice.
Yes. You must confirm you have the rights to clone the voice during the creation process.
Cloned voices speak over 100 languages with Realtime TTS-2, with cross-lingual synthesis that keeps the same voice identity across languages. On Realtime TTS 1.5, cloning supports 15 languages: English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, Italian, Dutch, Polish, Portuguese, Russian, Arabic, and Hebrew. Voices perform best when the synthesized text matches the language of the original sample.
Instant cloning (zero-shot) creates a usable voice from 5 to 15 seconds of audio. Available to all users. Professional cloning fine-tunes a model with 30+ minutes of audio for maximum fidelity. Recommended for uncommon voice types where instant cloning may not capture the full characteristics.
Yes. Automate voice creation programmatically. Useful for onboarding flows where users clone their own voice, or batch workflows. Code examples in JavaScript and Python on GitHub.
Cloning itself is free. You pay only for speech synthesis: down to $10 per 1M characters with Realtime TTS-2 and down to $5 per 1M with Realtime TTS 1.5 Mini. New pricing cut rates in half or more for most developers, and rates fall further as your volume grows. Full tiers are at inworld.ai/pricing.
Yes. Audio markups for emotion ([happy], [sad], [angry]), delivery style ([whispering], [laughing]), and non-verbal vocalizations ([breathe], [cough], [sigh]). SSML break tags for precise silence. Emotion markups are currently experimental, English only.

Start building

Join millions of developers building the next wave of AI applications.
Copyright © 2021-2026 Inworld AI
AI Voice Cloning: Clone Any Voice from 5 to 15 Seconds of Audio | Inworld