Inworld TTS API

The #1 ranked TTS API. Sub-200ms latency. Ship voice features in minutes, not weeks. From prototype to production on one API.

Realtime API response times

Audio chunks arrive before users notice a delay. REST, streaming, and WebSocket endpoints built for speed.
First-chunk delivery
Max returns first audio in <250ms at P90, Mini in <130ms. 4x improvement over TTS-1. Your slowest API calls still feel instant to end users.
Median Latency
<200ms median first chunk latency for Max, <100ms for Mini. ~4x faster than TTS-1. Voice agents respond before users notice a delay.
WebSocket streaming
Persistent bidirectional connections for realtime synthesis. Audio streams as it's generated — no buffering, no polling. Ideal for LLM-powered voice agents.

Top-ranked voice quality

Expressive, stable output that keeps users listening, validated by thousands of blind tests.
#1 on public benchmarks. Truly expressive.
#1 on the Artificial Analysis TTS Arena, scored by thousands of listeners in blind comparisons. TTS-1.5 delivers 30%+ more expressiveness than the previous generation.
Low error rate
40% fewer word errors than TTS-1. Fewer hallucinations, fewer cutoffs, fewer audio artifacts in production. Less post-processing, fewer edge cases to handle in your code.
Voice cloning via API
Clone any voice with a single API call. Pass 15 seconds of reference audio to get a unique voiceId, then use it in any TTS request. Professional fine-tuning available for maximum fidelity.

Built for massive scale

Half a cent per minute of interaction. Enhanced Multilingual support. On-prem deployment options.
Enhanced multilingual
English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-quality output in every language. Deploy globally without separate pipelines.
A user interface demonstrating multilingual support, with a dropdown menu to select languages such as Korean, Spanish, and German.
On-prem deployment
Run high-quality text-to-speech models locally — without sending text or audio data to the cloud. Built for enterprises that require strict data control, low latency, and compliance with internal or regulatory standards.
Half a cent per minute
$5/million characters for Mini, $10 for Max. Compared to $120+ elsewhere. At half a cent per minute, realtime voice can be always-on, not rate-limited by cost.

Full breakdown

Feature
TTS-1.5 Max
TTS-1.5 Mini
Best for
Most applications
Latency-critical applications
Pricing
1c per min ($10/million char)
0.5c per min ($5/million char)
P90 Latency
<250ms
<130ms
Quality
#1 ranked, maximum stability
#1 ranked
Multilingual
15 languages
15 languages
Voice cloning
Professional voice cloning
Character, word, viseme and phoneme timestamps
Custom pronunciation
On-Premise
We recommend TTS-1.5 Max for most use cases. The enhanced stability and quality are worth the marginal latency tradeoff for the vast majority of applications.

Research

Integrations

Try Inworld TTS now

Get started with TTS-1.5 Max, the best balance of quality and speed for most applications.

FAQs

What is a TTS API?
A TTS (text-to-speech) API converts written text into spoken audio via HTTP requests. You send text to an endpoint and receive audio back-as MP3, WAV, or streaming chunks. TTS APIs power voice assistants, audiobook generation, accessibility features, video narration, and AI voice agents. Modern APIs like Inworld offer realtime streaming, instant voice cloning, and support for 15 languages.
What is the best TTS API?
Inworld TTS is ranked #1 on the Artificial Analysis TTS Arena-a public leaderboard based on blind listening tests by thousands of real users. It combines top-ranked voice quality with sub-200ms latency and pricing at $5-10 per million characters. For production applications, TTS-1.5 Max offers the best balance of quality and speed.
What is the best alternative to ElevenLabs API?
Inworld TTS is ranked #1 on Artificial Analysis, ahead of ElevenLabs. TTS-1.5 Mini delivers <130ms P90 latency at $5/million characters, while TTS-1.5 Max offers <250ms at $10/million characters. Inworld also provides WebSocket streaming for realtime voice agents, on-premise deployment, and instant voice cloning from 5-15 seconds of audio.
How do I use a TTS API / integrate it into my app?
Make a POST request to the TTS endpoint with your text, voice ID, and model. You'll receive audio back as base64-encoded data. For realtime apps, use WebSocket streaming to receive audio chunks as text is generated. Works with Python, Node.js, curl, or any HTTP client. Generate an API key and check the Developer Quickstart for code examples.
What is the best low-latency TTS API?
Inworld TTS-1.5 Mini delivers <130ms P90 first-chunk latency. TTS-1.5 Max offers <250ms with enhanced stability. Both support native WebSocket streaming where audio generates instantly with no buffering delay. This makes Inworld ideal for voice agents, gaming, and any app where response time is critical.
Which TTS API has the most natural-sounding voices?
Inworld TTS-1.5 is ranked #1 for voice quality on Artificial Analysis, based on blind tests by thousands of real users. TTS-1.5 delivers 30%+ more expressiveness and 40% lower word error rate compared to TTS-1. It's optimized to minimize hallucinations, word cutoffs, and audio artifacts. Try it in the Playground.
Can I clone voices via a TTS API?
Yes. Inworld offers instant voice cloning-upload 5-15 seconds of audio and get a unique voice ID to use in TTS requests immediately. For maximum fidelity, professional voice cloning uses 30+ minutes of clean audio (minimum 5 minutes, 20+ minutes recommended). Contact sales for professional voice cloning.
What languages does the TTS API support?
Inworld TTS supports 15 languages: English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew. Use the List Voices endpoint to filter available voices by language. TTS-1.5 Max is recommended for multilingual applications due to superior pronunciation and intonation.
What's the difference between REST, streaming, and WebSocket TTS APIs?
REST returns complete audio after all text is processed-best for batch jobs and short text. HTTP streaming returns audio chunks progressively for faster playback start. WebSocket maintains a persistent bidirectional connection, ideal for voice agents where text is generated incrementally by an LLM. Inworld supports all three from a single API.
Why is my latency higher than sub-200ms?
Our published latency figures are P90 on-server inference time, which measures how long our models take to generate the first audio chunk once the request reaches our servers:
  • TTS-1.5 Max: <250ms P90, ~200ms median
  • TTS-1.5 Mini: <130ms P90, ~100ms median
Your end-to-end latency also includes network round-trip time between your application and our servers.
  • Use WebSocket streaming to avoid repeated connection overhead (Python, JavaScript)
  • Choose the server region closest to your users
  • Consider on-premise deployment for latency-critical applications
Copyright © 2021-2026 Inworld AI