Get started

Realtime TTS API

The #1 ranked TTS API. Sub-200ms latency. Ship voice features in minutes, not weeks. From prototype to production on one API.

Realtime API response times

Audio chunks arrive before users notice a delay. REST, streaming, and WebSocket endpoints built for speed.
First-chunk delivery
Max returns first audio in <250ms at P90, Mini in <130ms. 4x improvement over Realtime TTS 1. Your slowest API calls still feel instant to end users.
Median Latency
<200ms median first chunk latency for Max, <100ms for Mini. ~4x faster than Realtime TTS 1. Voice agents respond before users notice a delay.
WebSocket streaming
Persistent bidirectional connections for realtime synthesis. Audio streams as it's generated — no buffering, no polling. Ideal for LLM-powered voice agents.

Top-ranked voice quality

Expressive, stable output that keeps users listening, validated by thousands of blind tests.
#1 on public benchmarks. Truly expressive.
#1 on the Artificial Analysis TTS Arena, scored by thousands of listeners in blind comparisons. Realtime TTS 1.5 delivers 30%+ more expressiveness than the previous generation.
Low error rate
40% fewer word errors than Realtime TTS 1. Fewer hallucinations, fewer cutoffs, fewer audio artifacts in production. Less post-processing, fewer edge cases to handle in your code.
Voice cloning via API
Clone any voice with a single API call. Pass 15 seconds of reference audio to get a unique voiceId, then use it in any TTS request. Professional fine-tuning available for maximum fidelity.

Built for massive scale

Voice AI at $0.01/min. Enhanced Multilingual support. On-prem deployment options.
Enhanced multilingual
English, Spanish, French, Korean, Chinese, Hindi, Japanese, German, and more. Native-quality output in every language. Deploy globally without separate pipelines.
A user interface demonstrating multilingual support, with a dropdown menu to select languages such as Korean, Spanish, and German.
On-prem deployment
Run high-quality text-to-speech models locally — without sending text or audio data to the cloud. Built for enterprises that require strict data control, low latency, and compliance with internal or regulatory standards.
Voice AI at $0.01/min
$15/1M characters for Mini, $25 for Max. 87% cheaper than other providers. At $0.01/min, realtime voice can be always-on, not rate-limited by cost.

Full breakdown

Best for
Realtime TTS 1.5 Max
Most applications
Realtime TTS 1.5 Mini
Latency-critical applications
Pricing
Realtime TTS 1.5 Max
$0.025/min ($25/1M characters)
Realtime TTS 1.5 Mini
$0.01/min ($15/1M characters)
P90 Latency
Realtime TTS 1.5 Max
<250ms
Realtime TTS 1.5 Mini
<130ms
Quality
Realtime TTS 1.5 Max
#1 ranked, maximum stability
Realtime TTS 1.5 Mini
#1 ranked
Multilingual
Realtime TTS 1.5 Max
15 languages
Realtime TTS 1.5 Mini
15 languages
Voice cloning
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
Professional voice cloning
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
Character, word, viseme and phoneme timestamps
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
Custom pronunciation
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
On-Premise
Realtime TTS 1.5 Max
Realtime TTS 1.5 Mini
We recommend Realtime TTS 1.5 Max for most use cases. The enhanced stability and quality are worth the marginal latency tradeoff for the vast majority of applications.

Research

Integrations

Try Realtime TTS now

Get started with Realtime TTS 1.5 Max, the best balance of quality and speed for most applications.

FAQs

A TTS (text-to-speech) API converts written text into spoken audio via HTTP requests. You send text to an endpoint and receive audio back-as MP3, WAV, or streaming chunks. TTS APIs power voice assistants, audiobook generation, accessibility features, video narration, and AI voice agents. Modern APIs like Inworld offer realtime streaming, instant voice cloning, and support for over 100 languages with Realtime TTS-2.
Realtime TTS is ranked #1 on the Artificial Analysis TTS Arena-a public leaderboard based on blind listening tests by thousands of real users. It combines top-ranked voice quality with sub-200ms latency and pricing at $15-25 per million characters. For production applications, Realtime TTS 1.5 Max offers the best balance of quality and speed.
Realtime TTS is ranked #1 on Artificial Analysis, ahead of ElevenLabs. Realtime TTS 1.5 Mini delivers <130ms P90 latency at $15/1M characters, while Realtime TTS 1.5 Max offers <250ms at $25/1M characters. Inworld also provides WebSocket streaming for realtime voice agents, on-premise deployment, and instant voice cloning from 5-15 seconds of audio.
Make a POST request to the TTS endpoint with your text, voice ID, and model. You'll receive audio back as base64-encoded data. For realtime apps, use WebSocket streaming to receive audio chunks as text is generated. Works with Python, Node.js, curl, or any HTTP client. Generate an API key and check the Developer Quickstart for code examples.
Inworld Realtime TTS 1.5 Mini delivers <130ms P90 first-chunk latency. Realtime TTS 1.5 Max offers <250ms with enhanced stability. Both support native WebSocket streaming where audio generates instantly with no buffering delay. This makes Inworld ideal for voice agents, gaming, and any app where response time is critical.
Realtime TTS 1.5 is ranked #1 for voice quality on Artificial Analysis, based on blind tests by thousands of real users. Realtime TTS 1.5 delivers 30%+ more expressiveness and 40% lower word error rate compared to Realtime TTS 1. It's optimized to minimize hallucinations, word cutoffs, and audio artifacts. Try it in the Playground.
Yes. Inworld offers instant voice cloning-upload 5-15 seconds of audio and get a unique voice ID to use in TTS requests immediately. For maximum fidelity, professional voice cloning uses 30+ minutes of clean audio (minimum 5 minutes, 20+ minutes recommended). Contact sales for professional voice cloning.
Realtime TTS-2 supports over 100 languages with one voice identity preserved across every language and mid-utterance switching inside a single generation. Realtime TTS 1.5 supports 15 languages: English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew. Use the List Voices endpoint to filter available voices by language.
REST returns complete audio after all text is processed-best for batch jobs and short text. HTTP streaming returns audio chunks progressively for faster playback start. WebSocket maintains a persistent bidirectional connection, ideal for voice agents where text is generated incrementally by an LLM. Inworld supports all three from a single API.
Our published latency figures are P90 on-server inference time, which measures how long our models take to generate the first audio chunk once the request reaches our servers:
  • Realtime TTS 1.5 Max: <250ms P90, ~200ms median
  • Realtime TTS 1.5 Mini: <130ms P90, ~100ms median
Your end-to-end latency also includes network round-trip time between your application and our servers.
  • Use WebSocket streaming to avoid repeated connection overhead (Python, JavaScript)
  • Choose the server region closest to your users
  • Consider on-premise deployment for latency-critical applications
Copyright © 2021-2026 Inworld AI
Text-to-Speech API | Low-Latency Voice AI for Developers - ElevenLabs API alternative