Friend, late-20s
Casual phone call, mid-pivot
Hey, yeah, so I was, uh, I was just thinking, we should probably grab dinner before Friday.
Self-correction with a course change reads as warmth, not error.
Research preview · May 5, 2026
A new frontier voice model that feels as human as it sounds.
Realtime TTS-2 from Inworld AI is a new generation of voice model built for realtime conversation. It hears the full audio of the exchange, picks up the user's tone, pacing and emotional state, then takes voice direction in plain English the way developers prompt an LLM. It holds one voice identity across over 100 languages. Available today via the Inworld API and the Inworld Realtime API as a research preview.
What launch partners and customers are saying

“Inworld's TTS-2 marks a real step forward in emotionally expressive voice synthesis. When combined with the conversational intelligence of LiveKit agents, it enables interactions that feel genuinely human — responsive, nuanced, and alive in ways that feel natural.”
David Zhao · Co-Founder & CTO, LiveKit

“I've never seen steering work like this before TTS-2. The output is extremely natural and faithful to the steering prompt, even when it's hyper-specific. The biggest battle you fight with TTS is feeling bland, stale, and robotic — this level of steering unlocks a whole new axis to keep the experience fresh.”
Creston Brooks · Co-founder & CTO, Luvu

“We've always believed language learning should have no borders. TTS 2.0 just made that a lot more real.”
Dimitri Dekanozishvili · Co-founder, Talkpal

“We've been chasing the uncanny valley of voice AI for years — Inworld is finally closing the gap between 'impressive' and 'actually believable' with TTS 2.0. When your character speaks and you forget it's AI, that's when the story becomes real.”
Louis Muk · CEO, Isekai Zero

“We've had early access to Inworld TTS-2 for a few days and we're all blown away. The expressiveness, language steering and multi-lingual support are genuinely impressive. The subtle details like natural pausing make it hard to differentiate between AI and human.”
Nash Ramdial · Developer Relations, Stream

“Inworld just made voice AI feel genuinely human across 100+ languages. Partnering with them means we can help bring that experience to kids around the world, safely and compliantly.”
Kieran Donovan · CEO, k-ID

“AI Native games need characters you can deeply connect with. Voice models that offer full control and emotional complexity to make characters feel real is one of the biggest pieces missing. TTS 2 is a significant advance in helping make that future a reality.”
Nick Walton · CEO, Latitude

“Inworld was already at the top of the Artificial Analysis TTS Arena and Realtime TTS-2 pushes further on a dimension VoiceRun customers care about: directability. Style, pacing, emphasis, emotion, and delivery can be shaped in ways that matter for real enterprise deployments.”
VoiceRun team
Realtime TTS 1.5 already ranks #1 on the Artificial Analysis Speech Arena, ahead of Google and ElevenLabs. Quality is solved. So we asked the next question: what does voice AI sound like when it is built for the way humans actually talk to each other? In realtime, mutual, alive to the moment?
Voice AI was shaped by the static stuff: audiobooks, narration, voiceover. A sentence in, audio out, the model never hearing the person on the other end.
Realtime TTS-2 is built from the ground up for realtime conversation. It listens to the prior turns of the exchange, so your tone and pacing carry forward. It takes voice direction in plain English, so you steer the read the way a director would. It holds one voice identity across over 100 languages, so the speaker stays the same person mid-switch. And Advanced Voice Design lets you build a saved voice from prose. Four capabilities that work together, in one model, on the same realtime connection.
Hear it now · 4 scenes
Tired user · 11pm
A quieter, slower delivery for someone winding down at the end of the day.
Frustrated caller
Softer pace, careful phrasing. The model hears the upset and lowers the energy.
Crosslingual · EN → ES → JA
Three languages inside one generation. Same speaker, same person on the other end.
Voice direction · whisper
One prose direction reshapes the read. [whispering]
Capability 01
Available via REST + Realtime API
What it is. A natural-language description of how a line should be delivered, passed inline at the start of your text. Not a fixed list of preset emotions. Not a slider. Write the prompt the way you'd write a stage direction.
What it means for you. You can steer the voice the way a director would steer a voice actor. Same voice, same words, different read. Best practice: long, descriptive prompts beat short labels — [speak sadly, as if something bad just happened] directs the model far better than [sad].
How it works. Drop a bracket tag at the start of your text. The model picks up the delivery cue and shapes the read accordingly. Inline non-verbals like [sigh], [breathe], [laugh] go anywhere in the text.
[speak tired but warm, like she just got home from a long day]I missed you. How was today?
End-of-day affection. Lower energy, gentle smile.
Capability 02
What it is. The model takes the actual audio of the prior turns of the exchange as input, not just a transcript. It hears how the user actually sounded.
What it means for you. The same line lands differently after a joke than after bad news. The model knows the difference because it heard the prior turn. Tone, pacing, and emotional state carry forward automatically.
How it works. Audio context flows automatically across turns inside a Realtime session. Each user turn becomes part of the model's input. No explicit prior_audio field, no extra plumbing.
Prior turn. Positive
Context: a joke just landed
"Okay, so what do you want to do next?"
Light smile carries through. Brighter pitch.
Prior turn. Negative
Context: bad news, hesitation
"Okay, so what do you want to do next?"
Softer pace. Lower pitch. Careful.
Same exact text. Two different rooms. The model heard the difference.
Capability 03
What it is. One voice identity preserved across over 100 languages, including mid-utterance language switches inside a single generation.
What it means for you. Your user's teacher, support agent, or companion is the same person whether they speak in English, Spanish, Japanese, or switch between them mid-sentence. No per-language voice library to manage.
How it works. No language flag needed. The model handles language transitions automatically and keeps timbre, pitch, and character constant across the switch.
EN:I missed you. How was today?
Te extrañé. ¿Cómo estuvo hoy?
I missed you. How was today?
Capability 04
What it is. A new voice generated from a written prompt. Describe a person in prose, save the result as a reusable voice, then call it like any other voice in your app.
What it means for you. No reference audio required. No casting calls. You can prototype a voice in seconds, iterate on the description, and lock it once you find the right one.
“Pull up a chair. I want to tell you something I've been thinking about all week.”
prompt: warm low-pitch female with slight rasp, late-30s, intimate radio-host quality
The conversational layer
Beyond the four capabilities above, a few smaller tools push the speech further into "person paying attention" territory: inline non-verbal markers (laughs, sighs, breaths), the disfluencies that make recall feel real, voice cloning when you want to bring an existing voice in, and stability modes for dialling expressiveness up or down.
01 · Non-verbal markers
Drop inline tags inside the text at the exact moment a [laugh], [sigh], or [breathe] should occur. The model places them as audio events, not pronounced words.
Wait, you actually did that? [laugh] That's wild.
02 · Disfluencies
Self-correction, mid-noun-phrase pauses, and trailing thoughts that signal warmth and recall instead of malfunction. Different speakers cluster fillers differently and the model follows the rhythm.
Casual phone call, mid-pivot
Hey, yeah, so I was, uh, I was just thinking, we should probably grab dinner before Friday.
Self-correction with a course change reads as warmth, not error.
Interview, recalling a name
Um, that's a good question. I think... the honest answer is, we didn't really know at first.
Real recall pauses cluster mid-noun-phrase, not at sentence boundaries.
Late-night reflection
[sigh] I don't know. It's, uh, it's been one of those weeks where you just kind of... lose the thread.
Sigh plus filler plus trailing thought reads as fatigue, not malfunction.
Telling a story, fast pace
And then, okay, you have to picture this, he just, like, walks in totally calm.
Filler-as-energy stacks differently than filler-as-hesitation. Same model, different rhythm.
03 · Voice cloning
A two-step API. Upload a reference sample, get a voice ID, use it like any other voice. Clone from your original recording for the highest fidelity.
Audio sample: 5–15 seconds, clean, single speaker.
04 · Stability modes
Three trade-off settings on the same model. Pick the one that matches the deployment, not the demo.
Expressive
Most creative
Live consumer conversation, companions, characters. Range matters more than consistency.
Balanced
Default
Pick this when in doubt. Good for most agent workloads, support, productivity tools.
Stable
Most consistent
Professional deployments, IVR, long narrations. Pitch drift is unacceptable.
Built for natural realtime conversation
A real conversation isn't just words. It's the tone someone uses, the pause before they answer, the energy they carry into a sentence. Most voice agents stitch a pipeline together from four vendors and lose all of that signal at every handoff. We built each layer ourselves and pass the full audio context, the user's state, and the conversation history through one persistent connection — so the system can decide not just what to say, but how to say it.
Stage 01
Realtime STT transcribes and profiles the speaker in one pass. Age, accent, pitch, vocal style, emotional tone, and pacing become structured signals on the same connection. The rest of the pipeline knows who is talking and how they feel, not just what they said.
Stage 02
Realtime Router takes the user's state and the conversation context and selects the right model, prompt, and tools for the moment. Same request, different model for a tired late-night chat versus a complex support escalation. Reasoning, retrieval, and tool calls all happen on the same persistent connection.
Stage 03
Realtime TTS-2 takes the prior audio, the user's emotional state, the conversation history, and the developer's natural-language direction and decides how to deliver the line. Same words, different read for the moment. Sub-200ms first chunk, identity-preserved across over 100 languages.
All three stages on the same persistent connection. The output of each is the input to the next. The conversation is the input.
“Most TTS models generate speech in isolation from the conversation around them. TTS-2 is trained to use audio context from the full multi-turn exchange, and take voice direction so how the model speaks adjusts to how it was spoken to.”
Igor Poletaev · Chief Science Officer, Inworld AI
Live demo · realtime.ai
realtime.ai is the Inworld pipeline running live in your browser. Speak into your mic and the system hears you, profiles your voice, picks a model, and answers in Realtime TTS-2, all inside the Realtime API, all in one persistent connection.
For developers
Available across the platforms you already build on, with first-party SDKs in Node and Python and direct REST + Realtime API access.
Available through
The capabilities that change what you can build, not feature counts. Quality ranks come from the live Artificial Analysis Speech Arena, not hardcoded.
Capability | Inworld | Google | ElevenLabs | Cartesia | OpenAI | Hume |
|---|---|---|---|---|---|---|
Voice quality (Artificial Analysis Speech Arena) | #1 | #2 | #3 | Not stated | #5 | Not stated |
Natural conversational delivery | Yes | Yes | Not stated | Not stated | Yes | Not stated |
Realtime latency | Yes | Not stated | Not stated | Yes | Not stated | Not stated |
Multi-turn aware speech synthesis | Yes | Not stated | Not stated | Not stated | Yes | Not stated |
Simple voice direction (inline tags) | Yes | Yes | Yes | Yes | Yes | Yes |
Advanced voice direction (free-form descriptions) | Yes | Not stated | Not stated | Not stated | Yes | Not stated |
Voice cloning | Yes | Not stated | Yes | Yes | Not stated | Yes |
Voice design | Yes | Not stated | Yes | Not stated | Not stated | Yes |
Crosslingual (single voice, 100+ languages) | Yes | Not stated | Yes | Not stated | Not stated | Not stated |
Voice profiling (understand user context) | Yes | Not stated | Not stated | Not stated | Not stated | Not stated |
Single customizable speech-to-speech API | Yes | Not stated | Not stated | Not stated | Not stated | Not stated |
User-aware LLM routing | Yes | Not stated | Not stated | Not stated | Not stated | Not stated |
Optimized alphanumeric support | Yes | Not stated | Yes | Not stated | Not stated | Not stated |
"We are obsessed with how Voice AI feels, not just how it sounds."
Kylan Gibbs · CEO, Inworld AI
Realtime TTS-2 ships through the Inworld API and the Inworld Realtime API. Customers on Realtime TTS 1.5 upgrade by changing the model identifier, no other code changes. Code samples at docs.inworld.ai. Pricing at inworld.ai/pricing.
Realtime TTS-2 is a new generation of voice model from Inworld AI built for realtime conversation. It hears the full audio context of the exchange and the user's emotional state, tone, and pacing, then takes natural-language voice direction the way developers prompt an LLM. It speaks across over 100 languages with on-the-fly switching while preserving one voice identity. Available today via the Inworld API and the Inworld Realtime API as a research preview.
Four things. The model now conditions on prior multi-turn audio, not just the current sentence, so it adapts to how the user actually sounds. Voice direction is a natural-language string instead of a fixed emotion enum. Crosslingual switching preserves one voice identity across over 100 languages inside a single generation. Advanced Voice Design lets you create a saved voice persona from prose and pick a stability mode (Expressive, Balanced, or Stable). Customers on TTS 1.5 upgrade by switching the model identifier, no other code changes.
Most teams swap the endpoint, change the model identifier, and reclone any voices using their original reference audio rather than a previous model's output, which preserves more fidelity. The Realtime API speaks the OpenAI Realtime protocol with Inworld extensions, so existing OpenAI Realtime clients connect with one URL change. Reference docs at docs.inworld.ai.
Realtime TTS-2 is expanding to over 100 languages with on-the-fly switching inside a single generation, preserving the speaker's voice identity across every language. The top tier ships at native-speaker quality. The long tail is launch-window experimental as the model ships in research preview.
Voice direction is a natural-language string on the request, the same way you prompt an LLM. Pass a description like tired but warm, like she just got home or frantic, breathless, urgent. The model layers that delivery on top of whichever voice you have chosen. Inline non-verbal markers like [laugh], [sigh], [breathe], [clear_throat], and [cough] go inside the text where the moment should occur.
Sub-200ms median time-to-first-audio for the TTS layer alone. End-to-end through the Realtime API depends on what reasoning has to do, but the pipeline is designed to stay alive while reasoning runs: backchannel fillers stream in parallel, anticipatory generation begins before reasoning finishes, and partial responses reach the user before the full sentence is composed.
Open a WebSocket to wss://api.inworld.ai/api/v1/realtime/session, send a session.update event with session.audio.output.voice and session.audio.output.model, and stream audio back via response.audio.delta events. Audio is PCM16, 24kHz mono, base64. Audio context flows automatically across turns. First-party SDKs ship for Node and Python.
Pricing is metered by audio time and lives at inworld.ai/pricing. Pay-as-you-go with volume tiers. Realtime TTS-2 ships under the same metering as Realtime TTS 1.5, so customers upgrading do not see a model-side pricing change.
Yes. Voice cloning is a two-step API call: upload a reference audio sample (5-15 seconds, clean, single speaker) to /voices/v1/voices:clone, then use the returned voice ID like any other voice in TTS calls. Cloning from your original reference audio preserves more fidelity than cloning from another model's output. There is also Advanced Voice Design, which generates a saved voice from a written prompt without any reference audio.
Advanced Voice Design ships with three stability modes. Expressive is the most creative and best for live consumer conversation, companions, and characters. Balanced is the default and the right choice when in doubt. Stable is the most consistent across long generations and best for professional deployments, IVR, and long narrations where pitch drift would be unacceptable. The mode is a parameter on the voice design request.