Research preview · May 5, 2026

Realtime TTS-2

A new frontier voice model that feels as human as it sounds.

Realtime TTS-2 from Inworld AI is a new generation of voice model built for realtime conversation. It hears the full audio of the exchange, picks up the user's tone, pacing and emotional state, then takes voice direction in plain English the way developers prompt an LLM. It holds one voice identity across over 100 languages. Available today via the Inworld API and the Inworld Realtime API as a research preview.

Start integrating Try the live demo Read the docs

What launch partners and customers are saying

Voice AI that actually feels human.

“Inworld's TTS-2 marks a real step forward in emotionally expressive voice synthesis. When combined with the conversational intelligence of LiveKit agents, it enables interactions that feel genuinely human — responsive, nuanced, and alive in ways that feel natural.”

David Zhao · Co-Founder & CTO, LiveKit

“I've never seen steering work like this before TTS-2. The output is extremely natural and faithful to the steering prompt, even when it's hyper-specific. The biggest battle you fight with TTS is feeling bland, stale, and robotic — this level of steering unlocks a whole new axis to keep the experience fresh.”

Creston Brooks · Co-founder & CTO, Luvu

“We've always believed language learning should have no borders. TTS 2.0 just made that a lot more real.”

Dimitri Dekanozishvili · Co-founder, Talkpal

“We've been chasing the uncanny valley of voice AI for years — Inworld is finally closing the gap between 'impressive' and 'actually believable' with TTS 2.0. When your character speaks and you forget it's AI, that's when the story becomes real.”

Louis Muk · CEO, Isekai Zero

“We've had early access to Inworld TTS-2 for a few days and we're all blown away. The expressiveness, language steering and multi-lingual support are genuinely impressive. The subtle details like natural pausing make it hard to differentiate between AI and human.”

Nash Ramdial · Developer Relations, Stream

“Inworld just made voice AI feel genuinely human across 100+ languages. Partnering with them means we can help bring that experience to kids around the world, safely and compliantly.”

Kieran Donovan · CEO, k-ID

“AI Native games need characters you can deeply connect with. Voice models that offer full control and emotional complexity to make characters feel real is one of the biggest pieces missing. TTS 2 is a significant advance in helping make that future a reality.”

Nick Walton · CEO, Latitude

“Inworld was already at the top of the Artificial Analysis TTS Arena and Realtime TTS-2 pushes further on a dimension VoiceRun customers care about: directability. Style, pacing, emphasis, emotion, and delivery can be shaped in ways that matter for real enterprise deployments.”

Nick Leonard · CEO & Co-Founder, VoiceRun

Voice AI was built for audiobooks.
We rebuilt it for conversation.

Realtime TTS 1.5 already ranks #1 on the Artificial Analysis Speech Arena, ahead of Google and ElevenLabs. Quality is solved. So we asked the next question: what does voice AI sound like when it is built for the way humans actually talk to each other? In realtime, mutual, alive to the moment?

Voice AI was shaped by the static stuff: audiobooks, narration, voiceover. A sentence in, audio out, the model never hearing the person on the other end.

Realtime TTS-2 is built from the ground up for realtime conversation. It listens to the prior turns of the exchange, so your tone and pacing carry forward. It takes voice direction in plain English, so you steer the read the way a director would. It holds one voice identity across over 100 languages, so the speaker stays the same person mid-switch. And Advanced Voice Design lets you build a saved voice from prose. Four capabilities that work together, in one model, on the same realtime connection.

Hear it now · 4 scenes

Tired user · 11pm

A quieter, slower delivery for someone winding down at the end of the day.

Frustrated caller

Softer pace, careful phrasing. The model hears the upset and lowers the energy.

Crosslingual · EN → ES → JA

Three languages inside one generation. Same speaker, same person on the other end.

Voice direction · whisper

One prose direction reshapes the read. [whispering]

Capability 01

Voice Direction

Available via REST + Realtime API

What it is. A natural-language description of how a line should be delivered, passed inline at the start of your text. Not a fixed list of preset emotions. Not a slider. Write the prompt the way you'd write a stage direction.

What it means for you. You can steer the voice the way a director would steer a voice actor. Same voice, same words, different read. Best practice: long, descriptive prompts beat short labels — [speak sadly, as if something bad just happened] directs the model far better than [sad].

How it works. Drop a bracket tag at the start of your text. The model picks up the delivery cue and shapes the read accordingly. Inline non-verbals like [sigh], [breathe], [laugh] go anywhere in the text.

curl -X POST https://api.inworld.ai/tts/v1/voice \
-H "Authorization: Basic $INWORLD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "text": "[speak sadly, as if something bad just happened] I missed you. How was today?",
  "voice_id": "Sarah",
  "model_id": "inworld-tts-2",
  "audio_config": {
    "audio_encoding": "LINEAR16",
    "sample_rate_hertz": 48000
  }
}'

Try voice direction

Try a delivery tag

[speak tired but warm, like she just got home from a long day]I missed you. How was today?

End-of-day affection. Lower energy, gentle smile.

Capability 02

Conversational Awareness

What it is. The model takes the actual audio of the prior turns of the exchange as input, not just a transcript. It hears how the user actually sounded.

What it means for you. The same line lands differently after a joke than after bad news. The model knows the difference because it heard the prior turn. Tone, pacing, and emotional state carry forward automatically.

How it works. Audio context flows automatically across turns inside a Realtime session. Each user turn becomes part of the model's input. No explicit prior_audio field, no extra plumbing.

const ws = new WebSocket(
"wss://api.inworld.ai/api/v1/realtime/session"
);
ws.send(JSON.stringify({
type: "session.update",
session: {
  type: "realtime",
  model: "anthropic/claude-sonnet-4-6",
  audio: {
    input:  { transcription: { model: "inworld/inworld-stt-1" } },
    output: { model: "inworld-tts-2", voice: "Sarah" }
  }
}
}));
// Each user turn flows into the model automatically;
// the next response is conditioned on the prior audio.

Try the API

Prior turn. Positive

Context: a joke just landed

"Okay, so what do you want to do next?"

Light smile carries through. Brighter pitch.

Prior turn. Negative

Context: bad news, hesitation

"Okay, so what do you want to do next?"

Softer pace. Lower pitch. Careful.

Same exact text. Two different rooms. The model heard the difference.

Capability 03

Crosslingual

What it is. One voice identity preserved across over 100 languages, including mid-utterance language switches inside a single generation.

What it means for you. Your user's teacher, support agent, or companion is the same person whether they speak in English, Spanish, Japanese, or switch between them mid-sentence. No per-language voice library to manage.

How it works. No language flag needed. The model handles language transitions automatically and keeps timbre, pitch, and character constant across the switch.

curl -X POST https://api.inworld.ai/tts/v1/voice \
-H "Authorization: Basic $INWORLD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "text": "I'\''ll grab a coffee. ¿Quieres uno? お疲れさま。",
  "voice_id": "Sarah",
  "model_id": "inworld-tts-2"
}'

Try crosslingual

One speaker, every language

EN:I missed you. How was today?

LanguageES-ES

Te extrañé. ¿Cómo estuvo hoy?

I missed you. How was today?

Capability 04

Advanced Voice Design

What it is. A new voice generated from a written prompt. Describe a person in prose, save the result as a reusable voice, then call it like any other voice in your app.

What it means for you. No reference audio required. No casting calls. You can prototype a voice in seconds, iterate on the description, and lock it once you find the right one.

curl -X POST https://api.inworld.ai/voices/v1/voices:design \
-H "Authorization: Basic $INWORLD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "design_prompt": "A bright, enthusiastic young female voice with a California Valley accent in her early 20s.",
  "preview_text": "Oh my God, like, I literally can'''t even believe it! This, like, new juice bar just opened up, and their juices are, like, totally to die for.",
  "lang_code": "EN_US"
}'

Design a voice

Design a voice from a prompt

“Pull up a chair. I want to tell you something I've been thinking about all week.”

prompt: warm low-pitch female with slight rasp, late-30s, intimate radio-host quality

The conversational layer

What makes it sound conversational.

Beyond the four capabilities above, a few smaller tools push the speech further into "person paying attention" territory: inline non-verbal markers (laughs, sighs, breaths), the disfluencies that make recall feel real, voice cloning when you want to bring an existing voice in, and stability modes for dialling expressiveness up or down.

01 · Non-verbal markers

A laugh in the right place lands harder than a paragraph.

Drop inline tags inside the text at the exact moment a [laugh], [sigh], or [breathe] should occur. The model places them as audio events, not pronounced words.

Click a non-verbal

Wait, you actually did that? [laugh] That's wild.

02 · Disfluencies

Real uh and um, in the right places.

Self-correction, mid-noun-phrase pauses, and trailing thoughts that signal warmth and recall instead of malfunction. Different speakers cluster fillers differently and the model follows the rhythm.

Pick a speaker

Friend, late-20s

Casual phone call, mid-pivot

Hey, yeah, so I was, uh, I was just thinking, we should probably grab dinner before Friday.

Self-correction with a course change reads as warmth, not error.

Engineer, mid-30s

Interview, recalling a name

Um, that's a good question. I think... the honest answer is, we didn't really know at first.

Real recall pauses cluster mid-noun-phrase, not at sentence boundaries.

Tired user, late-30s

Late-night reflection

[sigh] I don't know. It's, uh, it's been one of those weeks where you just kind of... lose the thread.

Sigh plus filler plus trailing thought reads as fatigue, not malfunction.

Storyteller, late-20s

Telling a story, fast pace

And then, okay, you have to picture this, he just, like, walks in totally calm.

Filler-as-energy stacks differently than filler-as-hesitation. Same model, different rhythm.

03 · Voice cloning

Bring a real voice in. Use it everywhere.

A two-step API. Upload a reference sample, get a voice ID, use it like any other voice. Clone from your original recording for the highest fidelity.

Audio sample: 5–15 seconds, clean, single speaker.

# 1. Clone a voice from a base64-encoded audio sample
curl -X POST https://api.inworld.ai/voices/v1/voices:clone \
-H "Authorization: Basic $INWORLD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "display_name": "Alex",
  "lang_code": "EN_US",
  "voice_samples": [
    {
      "audio_data": "<base64-audio-data>"
    }
  ]
}'

# 2. Use the new voice
curl -X POST https://api.inworld.ai/tts/v1/voice \
-H "Authorization: Basic $INWORLD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "text": "Hey, look how simple it was to make a voice clone!",
  "voice_id": "<voice-id-from-step-1>",
  "model_id": "inworld-tts-2"
}'

Try voice cloning

Original

Sample uploaded

Clone

04 · Stability modes

Dial expressiveness up or down.

Three trade-off settings on the same model. Pick the one that matches the deployment, not the demo.

Try in playground

Expressive

Most creative

Live consumer conversation, companions, characters. Range matters more than consistency.

Balanced

Default

Pick this when in doubt. Good for most agent workloads, support, productivity tools.

Stable

Most consistent

Professional deployments, IVR, long narrations. Pitch drift is unacceptable.

Built for natural realtime conversation

Listening. Thinking. Expressing.

A real conversation isn't just words. It's the tone someone uses, the pause before they answer, the energy they carry into a sentence. Most voice agents stitch a pipeline together from four vendors and lose all of that signal at every handoff. We built each layer ourselves and pass the full audio context, the user's state, and the conversation history through one persistent connection — so the system can decide not just what to say, but how to say it.

Stage 01

Listening

Realtime STT transcribes and profiles the speaker in one pass. Age, accent, pitch, vocal style, emotional tone, and pacing become structured signals on the same connection. The rest of the pipeline knows who is talking and how they feel, not just what they said.

Realtime STT+ voice profiling

Stage 02

Thinking

Realtime Router takes the user's state and the conversation context and selects the right model, prompt, and tools for the moment. Same request, different model for a tired late-night chat versus a complex support escalation. Reasoning, retrieval, and tool calls all happen on the same persistent connection.

Realtime Router200+ models

Stage 03

Expressing

Realtime TTS-2 takes the prior audio, the user's emotional state, the conversation history, and the developer's natural-language direction and decides how to deliver the line. Same words, different read for the moment. Sub-200ms first chunk, identity-preserved across over 100 languages.

Realtime TTS-2+ voice direction

All three stages on the same persistent connection. The output of each is the input to the next. The conversation is the input.

“Most TTS models generate speech in isolation from the conversation around them. TTS-2 is trained to use audio context from the full multi-turn exchange, and take voice direction so how the model speaks adjusts to how it was spoken to.”

Igor Poletaev · Chief Science Officer, Inworld AI

Live demo · realtime.ai

Hear the whole pipeline in one breath.

realtime.ai is the Inworld pipeline running live in your browser. Speak into your mic and the system hears you, profiles your voice, picks a model, and answers in Realtime TTS-2, all inside the Realtime API, all in one persistent connection.

Open the demo at realtime.ai Read the Realtime API docs

For developers

Drop in. Ship today.

Available across the platforms you already build on, with first-party SDKs in Node and Python and direct REST + Realtime API access.

Available through

curl -X POST \
https://api.inworld.ai/tts/v1/voice:stream \
-H "Authorization: Basic $INWORLD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "text": "Hi! What can I help you with today?",
  "voice_id": "Sarah",
  "model_id": "inworld-tts-2",
  "audio_config": {
    "audio_encoding": "OGG_OPUS",
    "sample_rate_hertz": 16000
  }
}'
# Sub-200ms first-chunk latency.
# NDJSON stream of base64 audio chunks.

curl -X POST \
https://api.inworld.ai/tts/v1/voice \
-H "Authorization: Basic $INWORLD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "text": "Hey [laugh] welcome back.",
  "voice_id": "Sarah",
  "model_id": "inworld-tts-2",
  "audio_config": {
    "audio_encoding": "LINEAR16",
    "sample_rate_hertz": 48000
  }
}'

const ws = new WebSocket(
"wss://api.inworld.ai/api/v1/realtime/session"
);

ws.onopen = () => {
ws.send(JSON.stringify({
  type: "session.update",
  session: {
    type: "realtime",
    model: "anthropic/claude-sonnet-4-6", // LLM (any router model)
    instructions: "You are a helpful voice agent.",
    audio: {
      input: {
        transcription: { model: "inworld/inworld-stt-1" }
      },
      output: { model: "inworld-tts-2", voice: "Sarah" }
    },
    providerData: {
      stt: { voice_profile: true } // unlock paralinguistic signals
    }
  }
}));
};

ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "response.audio.delta") {
  // data.delta is base64-encoded PCM16 audio
  playChunk(atob(data.delta));
}
};

Read the docs API reference View examples on GitHub

Voice AI, side by side

The capabilities that change what you can build, not feature counts. Quality ranks come from the live Artificial Analysis Speech Arena, not hardcoded.

Capability	Inworld	Google	ElevenLabs	Cartesia	OpenAI	Hume
Voice quality (Artificial Analysis Speech Arena)	#1	#2	#3	Not stated	#5	Not stated
Natural conversational delivery	Yes	Yes	Not stated	Not stated	Yes	Not stated
Realtime latency	Yes	Not stated	Not stated	Yes	Not stated	Not stated
Multi-turn aware speech synthesis	Yes	Not stated	Not stated	Not stated	Yes	Not stated
Simple voice direction (inline tags)	Yes	Yes	Yes	Yes	Yes	Yes
Advanced voice direction (free-form descriptions)	Yes	Not stated	Not stated	Not stated	Yes	Not stated
Voice cloning	Yes	Not stated	Yes	Yes	Not stated	Yes
Voice design	Yes	Not stated	Yes	Not stated	Not stated	Yes
Crosslingual (single voice, 100+ languages)	Yes	Not stated	Yes	Not stated	Not stated	Not stated
Voice profiling (understand user context)	Yes	Not stated	Not stated	Not stated	Not stated	Not stated
Single customizable speech-to-speech API	Yes	Not stated	Not stated	Not stated	Not stated	Not stated
User-aware LLM routing	Yes	Not stated	Not stated	Not stated	Not stated	Not stated
Optimized alphanumeric support	Yes	Not stated	Yes	Not stated	Not stated	Not stated

Verified May 2026 from public docs and the Artificial Analysis Speech Arena leaderboard. Based on the latest models from each provider.

"We are obsessed with how Voice AI feels, not just how it sounds."

Kylan Gibbs · CEO, Inworld AI

Available today.

Realtime TTS-2 ships through the Inworld API and the Inworld Realtime API. Customers on Realtime TTS 1.5 upgrade by changing the model identifier, no other code changes. Code samples at docs.inworld.ai. Pricing at inworld.ai/pricing.

Get started free Read the docs Try in playground See pricing Talk to an architect

Frequently asked questions

What is Realtime TTS-2?

Realtime TTS-2 is a new generation of voice model from Inworld AI built for realtime conversation. It hears the full audio context of the exchange and the user's emotional state, tone, and pacing, then takes natural-language voice direction the way developers prompt an LLM. It speaks across over 100 languages with on-the-fly switching while preserving one voice identity. Available today via the Inworld API and the Inworld Realtime API as a research preview.

What is new compared to Realtime TTS 1.5?

Four things. The model now conditions on prior multi-turn audio, not just the current sentence, so it adapts to how the user actually sounds. Voice direction is a natural-language string instead of a fixed emotion enum. Crosslingual switching preserves one voice identity across over 100 languages inside a single generation. Advanced Voice Design lets you create a saved voice persona from prose and pick a stability mode (Expressive, Balanced, or Stable). Customers on TTS 1.5 upgrade by switching the model identifier, no other code changes.

How do I migrate from another provider like ElevenLabs, Cartesia, or Google?

Most teams swap the endpoint, change the model identifier, and reclone any voices using their original reference audio rather than a previous model's output, which preserves more fidelity. The Realtime API speaks the OpenAI Realtime protocol with Inworld extensions, so existing OpenAI Realtime clients connect with one URL change. Reference docs at docs.inworld.ai.

What languages does Realtime TTS-2 support?

Realtime TTS-2 is expanding to over 100 languages with on-the-fly switching inside a single generation, preserving the speaker's voice identity across every language. The top tier ships at native-speaker quality. The long tail is launch-window experimental as the model ships in research preview.

How do I steer the voice?

Voice direction is a natural-language string on the request, the same way you prompt an LLM. Pass a description like tired but warm, like she just got home or frantic, breathless, urgent. The model layers that delivery on top of whichever voice you have chosen. Inline non-verbal markers like [laugh], [sigh], [breathe], [clear_throat], and [cough] go inside the text where the moment should occur.

What is the latency?

Sub-200ms median time-to-first-audio for the TTS layer alone. End-to-end through the Realtime API depends on what reasoning has to do, but the pipeline is designed to stay alive while reasoning runs: backchannel fillers stream in parallel, anticipatory generation begins before reasoning finishes, and partial responses reach the user before the full sentence is composed.

How do I integrate with the Realtime API?

Open a WebSocket to wss://api.inworld.ai/api/v1/realtime/session, send a session.update event with session.audio.output.voice and session.audio.output.model, and stream audio back via response.audio.delta events. Audio is PCM16, 24kHz mono, base64. Audio context flows automatically across turns. First-party SDKs ship for Node and Python.

What is the pricing model?

Pricing is metered by audio time and lives at inworld.ai/pricing. Pay-as-you-go with volume tiers. Realtime TTS-2 ships under the same metering as Realtime TTS 1.5, so customers upgrading do not see a model-side pricing change.

Is voice cloning supported?

Yes. Voice cloning is a two-step API call: upload a reference audio sample (5-15 seconds, clean, single speaker) to /voices/v1/voices:clone, then use the returned voice ID like any other voice in TTS calls. Cloning from your original reference audio preserves more fidelity than cloning from another model's output. There is also Advanced Voice Design, which generates a saved voice from a written prompt without any reference audio.

What are the three voice design modes (Expressive, Balanced, Stable)?

Advanced Voice Design ships with three stability modes. Expressive is the most creative and best for live consumer conversation, companions, and characters. Balanced is the default and the right choice when in doubt. Stable is the most consistent across long generations and best for professional deployments, IVR, and long narrations where pitch drift would be unacceptable. The mode is a parameter on the voice design request.