Get started
Conversational AI

Build AI that holds a real conversation

Speech-to-text, any LLM, and the #1 ranked text-to-speech behind a single API. Voice profiling flows through to the model so the AI responds to how your users sound, not just what they said.
Live pipeline
User
stt inworld/inworld-stt-1voice_profile.emotion stressed

I'm so behind on everything and I don't even know where to start.

Agent
session.model anthropic/claude-sonnet-4-6voice Sarahtts.steering calm, slower pace

That's completely valid. Let's take a breath first, then we'll pick one thing together.

Every layer, best-in-class, one contract.

Your agent hears the tone, reasons on the model you pick, and answers in the voice users vote best.
One pipeline, not a stitched stack
Works with
Realtime API

Stitched is someone else's problem. Unified is yours to ship.

Four vendor contracts and a month of glue code, replaced by one persistent connection. STT, Router, and TTS under one API. Swap any layer, never rewrite the stack.
How conversational AI is built
Stitched stack
Vendor A (STT)
Vendor B (LLM)
Vendor C (TTS)
+ glue code
four contracts
One pipeline
Inworld Realtime API
STT • Router • TTS
one endpoint, one key
one contract
Best-in-class at every layer
Works with
Realtime APITTS

Best-in-class at every layer, without the integration tax.

Inworld TTS ranks #1 on Artificial Analysis with three of the top five voices. Router carries hundreds of LLMs. STT hears who's speaking and how they feel. No other pipeline stacks best-in-class at every layer.
Voice · #1 on Artificial Analysis
#1
Inworld TTS 1.5 Max
#2
Next best
#3
Inworld TTS 1 Max
#5
Inworld TTS 1.5 Mini
3 of top 5 · voice profile · cascade + speech-to-LLM
Voice profile flows through

Context that carries across every layer.

STT captures emotion, age, and vocal style. Router injects that context into the LLM. TTS adapts tone to match. Stitched stacks lose the signal at every handoff.
Voice profile flows through
STT
profile detected
emotion: stressed · age: 30s
Router
context injected
"user sounds stressed"
TTS
voice adapts
softer, slower, [sigh]
Cross-layer context carries through the pipeline. Stitched stacks can't do this.
Pick every layer

Best component, every layer, any time.

Route across Inworld STT, Whisper, or AssemblyAI. Reach GPT, Claude, Gemini, Llama, Mistral, and hundreds more through one endpoint. Pair with Inworld TTS 1.5 Max or Mini. One field changes the stack.
Explore the Router
Pick every layer
session.update
STT
Inworld STT-1
Whisper
AssemblyAI
Router / LLM
GPT
Claude
Gemini
Llama
Mistral
… hundreds more
TTS
Inworld 1.5 Max
Inworld 1.5 Mini
Sub-second end-to-end
Works with
Realtime API

Speech in, speech out, under a second.

Full-duplex streaming across the whole pipeline. STT detects the pause, Router routes the thought, TTS starts speaking before the sentence finishes forming. Human response time, end to end.
End-to-end · full pipeline
<1s
Speech in, speech out
STT → Router → TTS on one persistent connection
Deploy anywhere
Works with
Realtime API

Every surface users call from.

WebSocket for servers, WebRTC for browsers, SIP for call centers, on-premise for regulated data. Same API, same voice, same LLM choice. GDPR and SOC 2 Type II compliant.
Deploy anywhere
WebSocket
servers + back end
WebRTC
browser, no codec plumbing
SIP
phone + call center
On-prem
your data center
GDPR and SOC 2 Type II compliant. Zero data retention on TTS.

The whole pipeline, in forty lines

Connect once, configure every layer, stream audio. The handoffs disappear so your AI just listens and answers.
import WebSocket from 'ws'; const ws = new WebSocket( `wss://api.inworld.ai/api/v1/realtime/session?key=voice-${Date.now()}&protocol=realtime`, { headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } } ); ws.on('message', (data) => { const msg = JSON.parse(data.toString()); if (msg.type === 'session.created') { ws.send(JSON.stringify({ type: 'session.update', session: { type: 'realtime', model: 'anthropic/claude-sonnet-4-6', // any Router model instructions: 'You are a warm conversational assistant.', output_modalities: ['audio', 'text'], audio: { input: { transcription: { model: 'inworld/inworld-stt-1' }, turn_detection: { type: 'semantic_vad', eagerness: 'medium', create_response: true, interrupt_response: true, }, }, output: { model: 'inworld-tts-1.5-max', voice: 'Sarah', }, }, }, })); } if (msg.type === 'response.output_audio.delta') { audioQueue.push(base64ToPcm16(msg.delta)); if (!isPlaying) playNext(); } }); mic.on('data', (chunk) => { ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: chunk.toString('base64'), })); });

Hear the pipeline before you wire it.

Open the Realtime Playground, pick STT, pick the LLM, pick a voice, paste instructions, and click Connect. Speak to your pipeline end to end, hear the voice adapt to your tone, swap any layer mid-session. Copy the config into your code when it sounds right.
Open the playground

FAQ

Yes. The Realtime API exposes STT, Router, and TTS as a single connection, so you don't have to stitch together three vendor contracts or manage the glue code between them. Swap any layer with a one-field change in the session configuration.
Any LLM through the Inworld Router, including OpenAI GPT, Anthropic Claude, Google Gemini, Meta Llama, Mistral, and hundreds more. Switch by changing one field on the session configuration. See pricing for current rates.
Inworld STT-1 produces a voice profile (emotion, age, accent, pitch, and vocal style with confidence scores) that flows into the Router context automatically. The LLM reasons with the signal, and Router can emit TTS steering instructions that adapt the voice output to the user's emotional state. This cross-layer context is the reason to unify the pipeline.
Median first audio in under a second end-to-end (STT → LLM → TTS). TTS alone hits under 200ms; the remainder is your chosen LLM's reasoning time. Router lets you swap to a faster LLM when latency matters more than reasoning depth.
WebSocket, WebRTC, and SIP on the same endpoint, with on-premise deployment available for regulated environments. GDPR and SOC 2 Type II compliant. Zero data retention on TTS. Contact sales for on-premise or high-volume deployments.
Yes. Semantic VAD detects when the user starts speaking and cancels the in-flight response. The audio connection stays open, so the model can resume or pivot without closing the session. Barge-in works out of the box, no custom logic required.
Yes. The Realtime API follows the OpenAI Realtime protocol: same event names, same message flow. If you built on gpt-realtime, swap the base URL and keep your code. See the migration guide.
The Realtime API is in research preview. WebSocket is generally available; WebRTC and SIP are in early access. Reach out for early-access or on-premise deployment.

One pipeline. One endpoint. Every layer yours.

The top-ranked voice, hundreds of models, and the industry's most aware STT, on one connection.
Copyright © 2021-2026 Inworld AI
Conversational AI API: One Pipeline, Not a Stitched Stack | Inworld AI