Inworld Realtime API

Controllable speech-to-speech that understands, reasons, and interacts

Low-latency, natural speech-to-speech experiences via one API. Built-in multimodal capabilities, function calling and turn taking.
Live conversation
User
turn_detection semantic_vadeagerness mediumemotion stressedage young

I've been feeling really overwhelmed lately. I don't even know where to start.

Companion
voice Lunaoutput_modalities text, audio

That's completely valid. Let's take a breath together first, then we can talk through what's on your mind — one thing at a time.

Everything you need for realtime voice AI

The Inworld Realtime API keeps a persistent connection open so you can stream audio and receive responses the moment they're generated.

Full duplex, low-latency streaming

Full-duplex audio streaming over a single WebSocket or WebRTC connection.

Intelligent turn taking

Context-aware turn detection with adjustable eagerness.

Function calling

Register tools mid-session. The assistant calls your functions without breaking audio.

Provider agnostic

Route to the model that fits your latency, cost, or quality requirements, and swap it out at any time.

Dynamic context management

Create, retrieve, delete, or truncate conversation items mid-session to control context length and token cost.

Conversational intelligence

Use acoustic and metadata signals to condition what is said, when it is said, and how it is expressed.

Use cases

Common patterns for building with the Realtime API. Each maps to a real configuration you can copy and adapt.
// 1. Connect and configure ws.on('message', (buffer) => { const event = JSON.parse(buffer.toString()); if (event.type === 'session.created') { ws.send(JSON.stringify({ "type": "session.update", "session": { "type": "realtime", "modelId": "groq/gpt-oss-120b", "instructions": "You are a helpful voice agent.", "output_modalities": [ "audio", "text" ], "audio": { "input": { "turn_detection": { "type": "semantic_vad", "eagerness": "medium", "create_response": true, "interrupt_response": true } }, "output": { "model": "inworld-tts-1.5-max", "voice": "Liam" } } } })); } // 2. Queue and play audio chunks as they arrive if (event.type === 'response.output_audio.delta') { queue.push(base64ToArrayBuffer(event.delta)); if (!isPlaying) playNextChunk(); } }); // 3. Continuously stream mic audio (VAD handles turn detection) micStream.on('data', (pcmChunk) => { ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: toBase64(pcmChunk) })); });

FAQ

Yes. The Inworld Realtime API is fully compatible with the OpenAI Realtime API, so you can migrate by simply swapping the endpoint and auth credentials. A full migration guide is available here.
When using the realtime API, you only pay for the underlying model usage. Rates for all models are available here. Inworld gives you built-in tools to manage costs, like capping response length, canceling responses early, and trimming conversation history, so you stay in full control of your spend.
The Realtime API supports the languages available through the underlying models you select.
By default, you can run up to 20 concurrent conversations, with up to 1,000 requests per second shared across them. Need more? Contact our team to discuss higher limits for your use case.
The Realtime API gives you access to hundreds of models from leading providers, such as OpenAI, Anthropic, Google, Mistral, xAI, and more. You can pick the best model for your application without being locked into a single provider.
WebSocket is currently publicly available, with WebRTC and SIP support in early access. Please reach out to our team if you’d like access.

Start building in minutes

Get an API key, open a WebSocket, stream audio.
Copyright © 2021-2026 Inworld AI