Realtime API

Realtime Speech-to-Speech.

Low-latency, natural speech-to-speech experiences via one API. Built-in multimodal capabilities, function calling and turn taking.

Everything you need for real-time voice AI

The Inworld Realtime API keeps a persistent connection open so you can stream audio and receive responses the moment they're generated.

Low-latency audio streaming

Full-duplex audio streaming over a single WebSocket or WebRTC connection. First audio plays back before generation completes.

Intelligent turn taking

Context-aware turn detection with adjustable eagerness.

Function calling

Mid-session tool registration. Function calls execute and return without interrupting the audio stream.

Dynamic context management

Create, retrieve, delete, or truncate conversation items mid-session to control context length and token cost.

Provider agnostic

Route to the model that fits your latency, cost, or quality requirements, and swap it out at any time.

Full server-side control

Every state change emits a structured event. Gate responses, moderate context, orchestrate tools, and monitor rate limits from your backend.

Use cases

Common patterns for building with the Realtime API. Each maps to a real configuration you can copy and adapt.
01

Voice agent with turn taking abilities

Build a full-duplex voice agent that captures mic audio, detects speech boundaries automatically, and plays back responses as they stream. Semantic VAD handles turn-taking.
// 1. Connect and configure ws.on('message', (buffer) => { const event = JSON.parse(buffer.toString()); if (event.type === 'session.created') { ws.send(JSON.stringify({ "type": "session.update", "session": { "type": "realtime", "modelId": "groq/gpt-oss-120b", "instructions": "You are a helpful voice agent.", "output_modalities": [ "audio", "text" ], "audio": { "input": { "turn_detection": { "type": "semantic_vad", "eagerness": "medium", "create_response": true, "interrupt_response": true } }, "output": { "model": "inworld-tts-1.5-max", "voice": "Liam" } } } })); } // 2. Queue and play audio chunks as they arrive if (event.type === 'response.output_audio.delta') { queue.push(base64ToArrayBuffer(event.delta)); if (!isPlaying) playNextChunk(); } }); // 3. Continuously stream mic audio (VAD handles turn detection) micStream.on('data', (pcmChunk) => { ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: toBase64(pcmChunk) })); });

FAQ

Yes. The Inworld Realtime API is fully compatible with the OpenAI Realtime API, so you can migrate by simply swapping the endpoint and auth credentials. A full migration guide is available here.
When using the realtime API, you only pay for the underlying model usage. Rates for all models are available here. Inworld gives you built-in tools to manage costs, like capping response length, canceling responses early, and trimming conversation history, so you stay in full control of your spend.
The Realtime API supports the languages available through the underlying models you select.
By default, you can run up to 20 concurrent conversations, with up to 1,000 requests per second shared across them. Need more? Contact our team to discuss higher limits for your use case.
The Realtime API gives you access to all hundreds of models from leading providers, such as OpenAI, Anthropic, Google, Mistral, xAI, and more. You can pick the best model for your application without being locked into a single provider.
WebSocket is currently publicly available, with WebRTC and SIP support in early access. Please reach out to our team if you’d like access.

Start building in minutes

Get an API key, open a WebSocket, stream audio.

Copyright © 2021-2026 Inworld AI