Clone any voice from 5 to 15 seconds of audio. Instant cloning for rapid deployment, professional cloning for maximum fidelity. Every cloned voice runs on Realtime TTS with sub-200ms latency, emotion control, and support for over 100 languages, ready for realtime applications at scale.
Cloning is free. You pay only for the speech you generate.
Original voice
Cloned voice
| Feature | Instant Voice Cloning | Professional Voice Cloning |
|---|---|---|
| Audio required | 5 to 15 seconds | 30+ minutes (5 min minimum, 20+ recommended) |
| Availability | All users via Portal and API | Contact sales |
| Best for | Most applications, rapid prototyping, user-generated cloning | Uncommon voice types, brand voices, maximum fidelity |
| Supported formats | WAV, MP3, WebM (max 4MB) | Contact sales |
| Languages | over 100 languages | over 100 languages |
| Emotion and audio markups | Supported | Supported |
| Timestamp alignment | Word, character, phoneme, viseme | Word, character, phoneme, viseme |
| On-premise deployment | Available (H100/B200) Talk to our team | Available (H100/B200) Talk to our team |
| Zero data retention | Available | Available |
Persistent voice identity across sessions. Your companion sounds the same every time a user comes back.
Clone instructor voices for consistent tutoring experiences across over 100 languages.
Produce podcasts, narration, and video voiceovers in your own voice without being in the booth.
Clone talent and narrator voices for dynamic interactive media. Scale voice production without booking studio time for every line.
Preserve a person's voice for text-to-speech interfaces. Clone from a short sample and give users a voice that sounds like them.
| Inworld | ElevenLabs | Cartesia | |
|---|---|---|---|
| Min audio to clone | 5 seconds | ~60 seconds | 3 seconds |
| Latency (first audio) | <200ms (Max), <100ms (Mini) | ~300–400ms | ~40ms (base model) |
| Cost per 1M characters | Down to $5 (Mini) / $10 (TTS-2) | $100 (standard) | ~$35† |
| Languages | over 100 (Realtime TTS-2) | 70+ (Eleven v3) | 17 |
| On-premise deployment | ✓ H100/B200 | ✗ | ✗ |
| Zero data retention | ✓ | Enterprise only | ✗ |
| Emotion & audio markups | ✓ | Partial | ✗ |
Latency measured as median time-to-first-audio. Cost per 1M characters from provider pricing pages, June 2026. † Cartesia Sonic 3.5 is an estimated effective rate derived from published tier pricing.
