Get started
Speech-to-Speech

Speech-to-speech. Full duplex. Sub-second. Your LLM.

Stream audio in, get audio out, over one open session. Built on OpenAI Realtime? Change the base URL and keep your code. Sub-second end-to-end, and every layer (STT model, LLM, voice) is your choice, not ours.
S2S session
Audio in
turn_detection semantic_vadstt.model inworld/inworld-stt-1

PCM16 24kHz · 100ms chunks

Audio out
session.model anthropic/claude-sonnet-4-6voice Clive

response.output_audio.delta · streaming

Full-duplex speech-to-speech, without the black box.

Your agent hears, thinks, and speaks in under a second, with any LLM and any voice you pick.
Full-duplex audio

One connection. Audio in and out at the same time.

Bidirectional streaming over a single WebSocket session. No polling, no REST round-trips, no gap between user silence and agent voice. Responses form in real time.
One WebSocket · full duplex
input_audio_buffer.append
PCM16 · 24kHz · 100ms chunks
response.output_audio.delta
streaming · base64 PCM
OpenAI Realtime compatible

Same event names, same flow, same SDKs.

Every event in the OpenAI Realtime protocol works identically. Built on gpt-realtime? Change the base URL, keep the code. No rewrite, no new abstraction.
Protocol you already know
OpenAI Realtime compatible
session.update
input_audio_buffer.append
conversation.item.create
response.create
response.output_audio.delta
response.function_call_arguments.done
Same event names, same shape. Change the base URL, keep the code.
Sub-second end-to-end

Speech in, speech out, under a second.

Median first audio chunk lands under a second including STT, reasoning, and TTS. The wait is a beat your users barely notice.
Speech-to-speech · end-to-end
<1s
Median speech in, speech out
STT → reasoning → TTS. All on one connection.
Cascade by design
Works with
STTRouterTTS

Swap the voice, the model, or the STT without a rewrite.

Unified S2S models hide the pipeline in a black box. Cascade exposes every layer so you can debug, tune, and change your mind without retraining.
Cascade by design
Unified S2S
Inworld cascade
STT
gpt-realtime internal
inworld/inworld-stt-1, Whisper, AssemblyAI
LLM
gpt-realtime internal
claude-sonnet-4-6, gpt-5.4, Llama 4, +200
Voice
preset Alloy/Echo
inworld-tts-1.5-max, custom cloned
Unified models hide each layer. Cascade exposes every knob so you can debug, swap, tune.
Semantic turn-taking

Reads the pause like a person does.

Semantic VAD tells thinking apart from finishing, so agents never cut mid-thought or wait out real silence. Barge-in works. Eagerness is per session.
Semantic VAD
User says
"I was thinking..."
(0.8s pause)
Amplitude VAD
Cuts in
Silence triggers response
Semantic VAD
Waits
Trailing phrase is mid-thought
Reads grammar, not just audio amplitude. Configurable eagerness per session.
Three transports, one API

Ship the same agent to the browser, the server, and the phone line.

WebSocket for servers. WebRTC for browsers and native apps. SIP for phone systems and call centers with G.711. One session config covers all three.
Three transports, one API
one session config
WebSocket
GA
Servers, back-end clients
WebRTC
Early access
Browsers, native clients
SIP
Early access
Phone, call center

Stream in forty lines

Open the session, pipe the mic in, play the audio out. Your agent sounds alive on the first packet.
const ws = new WebSocket( 'wss://api.inworld.ai/api/v1/realtime/session?key=' + sessionId, ['realtime'] ); ws.on('message', (raw) => { const event = JSON.parse(raw.toString()); if (event.type === 'session.created') { ws.send(JSON.stringify({ type: 'session.update', session: { type: 'realtime', model: 'anthropic/claude-sonnet-4-6', output_modalities: ['audio', 'text'], audio: { input: { transcription: { model: 'inworld/inworld-stt-1' }, turn_detection: { type: 'semantic_vad', eagerness: 'medium' }, }, output: { model: 'inworld-tts-1.5-max', voice: 'Clive' }, }, }, })); } if (event.type === 'response.output_audio.delta') { audioQueue.push(base64ToPcm16(event.delta)); } }); mic.on('data', (chunk) => { ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: chunk.toString('base64'), })); });

FAQ

Yes. The speech-to-speech API follows the OpenAI Realtime protocol — same event names, same shape, same message flow. Change the base URL, keep your code. See the migration guide.
Median first audio chunk under a second end-to-end (STT → reasoning → TTS). TTS alone hits under 200ms; the rest depends on your chosen LLM.
Any LLM through the Inworld Router — OpenAI, Anthropic, Google, Meta, Mistral, xAI, Groq, and hundreds more. One field on session.update.
WebSocket is GA. WebRTC and SIP are in early access — reach out for access.
Cascade means STT, Router/LLM, and TTS are separate, swappable layers inside the session. Unified S2S models hide them in a black box. Cascade gives you control, debuggability, and per-layer tuning that end-to-end models can't match. It's a deliberate architectural choice.
Yes. Semantic VAD detects when the user starts speaking and cancels the in-flight response. Audio connection stays open.
Yes. Declare tools at session.update. When the model calls a function, the audio connection stays open while your server returns the result.
Research preview. WebSocket GA, WebRTC + SIP early access.

Full-duplex speech-to-speech. Same session. Every model.

OpenAI Realtime compatible. Sub-second end-to-end. WebSocket, WebRTC, and SIP.
Copyright © 2021-2026 Inworld AI
Speech-to-Speech API: Full-Duplex, Sub-Second, Model-Agnostic | Inworld AI