































































Stream audio chunks back as the model generates them. Sub-200ms first-chunk latency keeps the conversation feeling natural.
curl -X POST https://api.inworld.ai/tts/v1/voice:stream \
-H "Authorization: Basic $INWORLD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Hi! What can I help you with today?",
"voice_id": "Clive",
"model_id": "inworld-tts-2",
"audio_config": {
"audio_encoding": "OGG_OPUS",
"sample_rate_hertz": 16000
}
}'curl -X POST https://api.inworld.ai/tts/v1/voice:stream \
-H "Authorization: Basic $INWORLD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Hi! What can I help you with today?",
"voice_id": "Clive",
"model_id": "inworld-tts-2",
"audio_config": {
"audio_encoding": "OGG_OPUS",
"sample_rate_hertz": 16000
}
}'Stream audio chunks back as the model generates them. Sub-200ms first-chunk latency keeps the conversation feeling natural.
3 of the top 5 models on Artificial Analysis are Inworld. Blind tests by thousands of real users, not internal evals. Realtime TTS 1.5 Max delivers over 30% more expressiveness than previous models, with optimized stability to eliminate hallucinations and artifacts.
Test out Quality3 of the top 5 models on Artificial Analysis are Inworld. Blind tests by thousands of real users, not internal evals. Realtime TTS 1.5 Max delivers over 30% more expressiveness than previous models, with optimized stability to eliminate hallucinations and artifacts.
Test out QualityCreate a custom voice from 15 seconds of audio, then localize it to speak over 100 languages as a native speaker — same identity, no accent carryover. Production-ready voices you can use in the Playground or via API.


Create a custom voice from 15 seconds of audio, then localize it to speak over 100 languages as a native speaker — same identity, no accent carryover. Production-ready voices you can use in the Playground or via API.
Skip recording entirely. Describe accent, age, tone, and energy in natural language, and Inworld renders a production-ready voice on the fly. Pick a preset on the card to hear how a single sentence becomes a finished voice.

Skip recording entirely. Describe accent, age, tone, and energy in natural language, and Inworld renders a production-ready voice on the fly. Pick a preset on the card to hear how a single sentence becomes a finished voice.

Built for realtime from the ground up — audio generates the instant it's synthesized via WebSocket. No buffering delay. Comparable latency to competitors at a fraction of the cost.


Built for realtime from the ground up — audio generates the instant it's synthesized via WebSocket. No buffering delay. Comparable latency to competitors at a fraction of the cost.
Add bracketed instructions anywhere in your text and Realtime TTS-2 adjusts the utterance. Pair with various non-verbals and adjustable pauses for delivery that matches the moment, not just the words.

Add bracketed instructions anywhere in your text and Realtime TTS-2 adjusts the utterance. Pair with various non-verbals and adjustable pauses for delivery that matches the moment, not just the words.

English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-speaker quality in every language with cross-lingual cloning. Deploy globally without separate pipelines.
Test out LanguagesEnglish, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-speaker quality in every language with cross-lingual cloning. Deploy globally without separate pipelines.
Test out LanguagesRealtime TTS 1.5 Mini at $15/million characters. Realtime TTS 1.5 Max and Realtime TTS-2 at $25/million. Comparable providers charge $120/million ($0.12/min) — Inworld is up to 87% cheaper at scale.
View pricing
Realtime TTS 1.5 Mini at $15/million characters. Realtime TTS 1.5 Max and Realtime TTS-2 at $25/million. Comparable providers charge $120/million ($0.12/min) — Inworld is up to 87% cheaper at scale.
View pricing
