Voice Cloning

The most realistic voice cloning

Q: What is AI voice cloning?

AI voice cloning creates a digital replica of a specific voice from audio samples. The cloned voice can then generate new speech that sounds like the original speaker. Inworld offers instant voice cloning from 5 to 15 seconds of audio, and professional voice cloning from 30+ minutes for maximum fidelity.

Q: How do I clone a voice with Inworld?

Open the TTS Playground in the Inworld Portal. Click Create Voice, then Clone. Upload or record 5 to 15 seconds of audio. Your cloned voice is ready to use in the Playground or via API within minutes. For automation, use the Voice Cloning API with JavaScript or Python.

Q: How much audio do I need?

Instant cloning: 5 to 15 seconds. Professional cloning: 5 minutes minimum, 20 to 30+ minutes recommended. For instant cloning, record in a quiet environment and speak with varied emotion to capture the full range of the voice.

Q: Can I use cloned voices commercially?

Yes. You must confirm you have the rights to clone the voice during the creation process.

Q: Which languages support voice cloning?

Cloned voices speak over 100 languages with Realtime TTS-2, with cross-lingual synthesis that keeps the same voice identity across languages. On Realtime TTS 1.5, cloning supports 15 languages: English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, Italian, Dutch, Polish, Portuguese, Russian, Arabic, and Hebrew. Voices perform best when the synthesized text matches the language of the original sample.

Q: What is the difference between instant and professional voice cloning?

Instant cloning (zero-shot) creates a usable voice from 5 to 15 seconds of audio. Available to all users. Professional cloning fine-tunes a model with 30+ minutes of audio for maximum fidelity. Recommended for uncommon voice types where instant cloning may not capture the full characteristics.

Q: Is there a voice cloning API?

Yes. Automate voice creation programmatically. Useful for onboarding flows where users clone their own voice, or batch workflows. Code examples in JavaScript and Python on GitHub.

Q: How much does voice cloning cost?

Cloning itself is free. You pay only for speech synthesis: down to $10 per 1M characters with Realtime TTS-2 and down to $5 per 1M with Realtime TTS 1.5 Mini. New pricing cut rates in half or more for most developers, and rates fall further as your volume grows. Full tiers are at inworld.ai/pricing.

Q: Can I control emotion and delivery?

Yes. Audio markups for emotion ([happy], [sad], [angry]), delivery style ([whispering], [laughing]), and non-verbal vocalizations ([breathe], [cough], [sigh]). SSML break tags for precise silence. Emotion markups are currently experimental, English only.

Clone any voice from 5 to 15 seconds of audio. Instant cloning for rapid deployment, professional cloning for maximum fidelity. Every cloned voice runs on Realtime TTS with sub-200ms latency, emotion control, and support for over 100 languages, ready for realtime applications at scale.

Clone a Voice Talk to our team

Cloning is free. You pay only for the speech you generate.

Original voice

5 sec sample

Cloned voice

Ready in minutessub-200ms latency

Clone any voice, three ways

Instant cloning for rapid deployment and user-generated voices. Professional cloning for maximum fidelity where it matters most.

Instant voice cloning

Create a custom voice from 5 to 15 seconds of audio. Zero-shot, available to all users in the Portal. Upload or record a sample and start generating speech in minutes.

Voice Cloning API

Automate voice creation programmatically. Let your users clone their own voices during onboarding, or batch-create voices for content workflows. JavaScript and Python examples on GitHub.

Professional voice cloning

Fine-tuned from 30+ minutes of clean audio. Recommended for uncommon voice types: children's voices, unique accents, or brand voices where instant cloning may not capture the full characteristics. Available by contacting the Inworld team.

Built for realtime applications

Every cloned voice runs on Realtime TTS with the same production-grade latency, expressiveness, and multilingual support, and drops straight into the Realtime API alongside speech-to-text and LLM routing on one bill.

Start cloning

Production-grade latency

<200ms median first chunk for Realtime TTS 1.5 Max. <100ms for Realtime TTS 1.5 Mini. Streaming-native over WebSocket. No buffering delay.

Expression and emotion control

Audio markups for emotion ([happy], [sad], [angry]), delivery style ([whispering], [laughing]), and non-verbal vocalizations ([breathe], [cough], [sigh]). SSML break tags for precise silence placement.

over 100 languages

English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-speaker quality in every language, with cross-lingual cloning on Realtime TTS-2.

Voice AI down to $0.005/min

Down to $5 per 1M characters with Realtime TTS 1.5 Mini and down to $10 per 1M with Realtime TTS-2. New pricing cuts rates in half or more for most developers, and your rate keeps falling as you scale. On-prem deployment for enterprises with data sovereignty requirements.

Full breakdown

Feature	Instant Voice Cloning	Professional Voice Cloning
Audio required	5 to 15 seconds	30+ minutes (5 min minimum, 20+ recommended)
Availability	All users via Portal and API	Contact sales
Best for	Most applications, rapid prototyping, user-generated cloning	Uncommon voice types, brand voices, maximum fidelity
Supported formats	WAV, MP3, WebM (max 4MB)	Contact sales
Languages	over 100 languages	over 100 languages
Emotion and audio markups	Supported	Supported
Timestamp alignment	Word, character, phoneme, viseme	Word, character, phoneme, viseme
On-premise deployment	Available (H100/B200) Talk to our team	Available (H100/B200) Talk to our team
Zero data retention	Available	Available

Use cases

Social & Tech

AI Companions and Social Apps

Persistent voice identity across sessions. Your companion sounds the same every time a user comes back.

Education

Language Learning

Clone instructor voices for consistent tutoring experiences across over 100 languages.

Creative

Content Production

Produce podcasts, narration, and video voiceovers in your own voice without being in the booth.

Interactive

Interactive Media

Clone talent and narrator voices for dynamic interactive media. Scale voice production without booking studio time for every line.

Impact

Accessibility

Preserve a person's voice for text-to-speech interfaces. Clone from a short sample and give users a voice that sounds like them.

How Inworld compares

	Inworld	ElevenLabs	Cartesia
Min audio to clone	5 seconds	~60 seconds	3 seconds
Latency (first audio)	<200ms (Max), <100ms (Mini)	~300–400ms	~40ms (base model)
Cost per 1M characters	Down to $5 (Mini) / $10 (TTS-2)	$100 (standard)	~$35†
Languages	over 100 (Realtime TTS-2)	70+ (Eleven v3)	17
On-premise deployment	✓ H100/B200	✗	✗
Zero data retention	✓	Enterprise only	✗
Emotion & audio markups	✓	Partial	✗

Latency measured as median time-to-first-audio. Cost per 1M characters from provider pricing pages, June 2026. † Cartesia Sonic 3.5 is an estimated effective rate derived from published tier pricing.

FAQ