Question 1

How do I use text-to-speech?

Accepted Answer

Getting started is simple. You can try Inworld TTS instantly in the TTS Playground, where you can test voices, adjust settings, and experiment with features like instant voice cloning. When you're ready to integrate TTS into your application, you can follow the Developer Quickstart to make your first API request in minutes. Just create an API key in the Inworld Portal, then synthesize speech with a single POST request. Inworld supports multiple output formats, including MP3, Linear PCM (WAV), and Opus, making it easy to integrate with almost any system.

Question 2

How is Inworld TTS quality evaluated?

Accepted Answer

Inworld TTS is evaluated through blind listening tests by thousands of real users. TTS-1.5 Max delivers over 30% more expressiveness than its predecessor, with optimized stability and natural conversational delivery.

Question 3

Which TTS-1.5 model should I use?

Accepted Answer

For most applications: TTS-1.5 Max (~250ms latency, $30/1M characters) TTS-1.5 Max offers the best balance of quality and speed. The enhanced stability means fewer edge cases, better voice cloning fidelity, and more consistent output across languages. For latency-critical applications: TTS-1.5 Mini (~130ms latency, $15/1M characters) Choose TTS-1.5 Mini only if minimal latency is your absolute top priority — for example, real-time gaming or ultra-responsive voice agents where every millisecond matters.

Question 4

What is the latency and time-to-first-byte (TTFB) of Inworld TTS?

Accepted Answer

TTS-1.5 Mini achieves ~130ms first-chunk latency. TTS-1.5 Max delivers ~250ms with enhanced stability and quality. Both support real-time streaming via WebSocket. For most applications, we recommend TTS-1.5 Max — the quality improvement is worth the marginal latency tradeoff.

Question 5

Does Inworld offer voice cloning?

Accepted Answer

Yes. Inworld provides two types of voice cloning: Instant (zero-shot) voice cloning Available to all users in the Portal Creates a custom voice from just 15 seconds of audio Ready to use in minutes Professional voice cloning Fine-tuned using 30+ minutes of clean audio (minimum ~5 minutes, 20+ minutes recommended for best results) Recommended for uncommon voice types such as children's voices or unique accents, where instant cloning may not perform well Currently available by contacting the Inworld sales team Both methods allow you to use your cloned voice directly in the Playground or via API using its unique voiceId.

Question 6

Which languages does Inworld TTS support?

Accepted Answer

Inworld TTS-1.5 supports 15 languages: English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew. For multilingual applications, we recommend TTS-1.5 Max for the best pronunciation, intonation, and naturalness across all supported languages.

Question 7

Can I control emotion, speed, and other voice characteristics?

Accepted Answer

Absolutely. Inworld TTS provides several ways to customize how the speech sounds: Voice parameters Temperature: Controls expressiveness and randomness Talking speed: 0.5× to 1.5× of the native speaking rate

Question 8

Does Inworld support lipsync, word highlighting, or timestamp alignment?

Accepted Answer

Yes. Inworld TTS supports timestamp alignment for word, character, phoneme, and viseme level synchronization. This can be helpful for subtitles, captions, lipsync, and more. You can enable it in your API request by setting timestampType to WORD or CHARACTER. The API response includes: word or character tokens start and end timestamps (in seconds) structured alignment data matching the generated audio phoneme-level timing and viseme symbols for lip-sync (TTS 1.5 models only) Timestamp alignment currently supports English, with other languages available experimentally. Note that enabling timestamps currently adds roughly 100 ms of additional latency.

Question 9

What's new in TTS-1.5?

Accepted Answer

TTS-1.5 is a major update delivering improvements across speed, quality, and accessibility: The Fastest: ~130ms first-chunk latency — the fastest realtime TTS available. TTS-1.5 Max delivers ~250ms with enhanced quality. The Highest Quality: Optimized stability to minimize hallucinations, cutoffs, and artifacts. Over 30% more expressive than TTS-1. The Most Accessible: 15 languages (including Hindi), enhanced voice cloning, on-premise H100/B200 deployment, and 25x lower cost than alternatives. Which model should I use? For most applications, we recommend TTS-1.5 Max. Use TTS-1.5 Mini only when minimal latency is the top priority.

Question 10

Can I migrate my voices from ElevenLabs to Inworld?

Accepted Answer

Yes. Inworld provides a free, open-source ElevenLabs Migration Tool that lets you batch-transfer your custom voice clones from ElevenLabs to Inworld. The tool automatically downloads your ElevenLabs voice samples, handles audio processing (format conversion, padding, and trimming), and re-clones them in Inworld. The migration runs entirely on your local machine with direct API communication — no data is proxied through any intermediary servers. You can also preview your migrated voices with Inworld TTS before finalizing.

#1 ranked, most natural voice AI.

Try it live

The top ranked TTS in the world. Proven by real users.

The top ranked TTS in the world. Proven by real users.

Instant custom voice creation.

Instant custom voice creation.

Realtime latency. Feels instant.

Realtime latency. Feels instant.

15+ languages. Native-speaker quality.

15+ languages. Native-speaker quality.

Starting at $15 per million characters.

Starting at $15 per million characters.

Built for voice-first applications

Integrate TTS in minutes

Full breakdown

FAQ

Instant (zero-shot) voice cloning

Professional voice cloning

Voice parameters

Try Inworld TTS now