Inworld TTS

State-of-the-art voice AI at a radically accessible price. Instant voice cloning, rich multilingual support, real-time streaming, and emotion plus non-verbal control, all for just $5 per million characters.
138/1000

Features

Radically accessible pricing
$5 per million characters—just 5% of competitors’ pricing—with no compromises on quality.
Multi-lingual
Multiple languages including English, Spanish, French, Korean, and Chinese, all with native-speaker quality.
State-of-the-art quality
Launched at #1 on Hugging Face TTS Arena with clearer speech, lower WER, and higher SIM than leading systems.
Blazingly fast
Sub-250 ms latency optimized for real-time conversational AI with streaming support.
Voice cloning
Create custom voices instantly from 2–15 seconds of audio, or fine-tune a professionally cloned voice.
Voice tags
Add emotion, delivery style, and non-verbal sounds to make speech more expressive and natural.

Full breakdown

Version
Inworld-TTS-1
Inworld-TTS-1-max
Radically accessible pricing
$5/1M characters
(≈ $0.25 per audio-hour)
$10/1M characters
(≈ $0.50 per audio-hour)
Power
State-of-the-art quality
(WER & similarity)
Real-time latency
Soon
Multilingual
Free zero-shot voice cloning
Professional voice cloning
(custom fine-tuning)
Audio markups
(emotion/style/non-verbals)
Timestamp alignment
Custom pronunciation
Embedded safeguards
SOC2 Type II
GDPR
On-Premise deployments
Open-source training & modeling code
Cross-lingual
(same voice, language switch)

Research

Cutting-edge research
Publications
Explore our latest research advancing the state of the art in speech synthesis, voice cloning, and real-time TTS
Training code available
Open source
We’ve open-sourced the full training framework behind Inworld TTS-1 — everything from codec to SpeechLM fine-tuning — so you can build your own high-quality TTS models faster.

Integrations

Try Inworld TTS now

Test out zero-shot voice-cloning, audio mark-ups and so much more in our TTS Playground

FAQ

How do I use text-to-speech?
Getting started is simple. You can try Inworld TTS instantly in the TTS Playground, where you can test voices, adjust settings, and experiment with features like audio markups and instant voice cloning.
When you’re ready to integrate TTS into your application, you can follow the Developer Quickstart to make your first API request in minutes. Just create an API key in the Inworld Portal, then synthesize speech with a single POST request. Inworld supports multiple output formats, including MP3, Linear PCM (WAV), and Opus, making it easy to integrate with almost any system.
How do Inworld's models perform on public benchmarks?
Inworld TTS Max launched as the #1 model on the Artificial Analysis TTS Arena and is also consistently ranked #1 on the Hugging Face TTS Arena.
What is the latency and time-to-first-byte (TTFB) of Inworld TTS?
Inworld TTS is optimized for real-time conversational AI and typically achieves sub-250 ms latency. For the fastest real-time performance, we recommend using the Inworld TTS model with websockets.
Does Inworld offer voice cloning?
Yes. Inworld provides two types of voice cloning:

Instant (zero-shot) voice cloning

  • Available to all users in the Portal
  • Creates a custom voice from just 5-15 seconds of audio
  • Ready to use in minutes

Professional voice cloning

  • Fine-tuned using 30+ minutes of clean audio (minimum ~5 minutes, 20+ minutes recommended for best results)
  • Recommended for uncommon voice types such as children's voices or unique accents, where instant cloning may not perform well
  • Currently available by contacting the Inworld sales team
Both methods allow you to use your cloned voice directly in the Playground or via API using its unique voiceId.
Which languages does Inworld TTS support?
Inworld’s TTS models currently support 12 languages: English (en), Spanish (es), French (fr), Korean (ko), Dutch (nl), Chinese (zh), German (de), Italian (it), Japanese (ja), Polish (pl), Portuguese (pt), and Russian (ru) (with Hindi, Arabic, and Hebrew coming soon).
For multilingual applications, we recommend using Inworld TTS Max for improved pronunciation, intonation, and naturalness across languages.
Can I control emotion, speed, and other voice characteristics?
Absolutely. Inworld TTS provides several ways to customize how the speech sounds:

Voice parameters

  • Temperature: Controls expressiveness and randomness
  • Talking speed: 0.5× to 1.5× of the native speaking rate

Audio Markups (Experimental)

Supported in English only:
  • Emotions: [happy], [sad], [angry], [surprised], [fearful], [disgusted]
  • Delivery styles: [laughing], [whispering]
  • Non-verbal sounds: [breathe], [clear_throat], [cough], [laugh], [sigh], [yawn]
We recommend placing audio markups at the beginning of the text and use only one per request for the most consistent output.

Does Inworld support lipsync, word highlighting, or timestamp alignment?
Yes. Inworld TTS supports timestamp alignment for both word-level and character-level synchronization. This can be helpful for subtitles, captions, lipsync, and more.
You can enable it in your API request by setting timestampType to WORD or CHARACTER.
The API response includes:
  • word or character tokens
  • start and end timestamps (in seconds)
  • structured alignment data matching the generated audio
Timestamp alignment currently supports English, with other languages available experimentally. Note that enabling timestamps currently adds roughly 100 ms of additional latency.
Copyright © 2021-2025 Inworld AI