Get started
Realtime API

Controllable speech-to-speech that understands, reasons, and interacts

Low-latency, natural speech-to-speech experiences via one API. Built-in multimodal capabilities, function calling and turn taking.
<1s
Latency
Hundreds
of Models
#1
Ranked Quality

Every configuration, one session

Pick any LLM for the conversation engine. Swap providers without changing your integration.

// Configure your realtime session ws.send(JSON.stringify({ "type": "session.update", "session": { "type": "realtime", "modelId": "anthropic/claude-sonnet-4-6", "instructions": "You are a helpful voice agent.", "output_modalities": ["audio", "text"], "audio": { "output": { "model": "inworld-tts-1.5-max", "voice": "Sarah" } } } }));

Sub-second response time

Optimized data flow delivers end-to-end speech-to-speech latency under one second. Voice agents respond with human-level cadence.

  • Optimized STT, LLM, and TTS pipeline for the best latency and quality.
  • Full-duplex audio streaming over WebSocket or WebRTC
Experience it live
Realtime API
<1s
Speech-to-speech latency
STT
200ms
LLM
400ms
TTS
180ms

Intelligent turn taking

Context-aware semantic VAD with adjustable eagerness. The agent knows when to listen, when to speak, and when a user is interrupting.

  • Semantic VAD detects intent boundaries, not just silence
  • Adjustable eagerness from cautious to aggressive
  • Graceful barge-in handling — no awkward overlaps or cut-offs
Try it in Playground
Live call
User
Hi I'd like to order 12 iced teas…
User
… I mean two taro bobas
Agent
Two taro bubble teas coming up!

Conversational intelligence

Inworld's STT generates voice personas — emotion, age, accent, and speaking rate — alongside transcriptions. These signals are automatically used by the LLM Router and TTS to improve generation quality.

  • Realtime STT extracts emotion, age, accent, and speaking rate from audio
  • Voice persona signals flow into the LLM Router for context-aware responses
  • TTS adapts tone and pacing based on detected conversational signals
Hear the difference
Per audio chunk
Emotion
Frustrated
92%
Age
25–34
87%
Accent
British
94%
Rate
Fast
89%
→ Injected into LLM and TTS context

Provider agnostic, full control

Route to hundreds of LLMs, choose your STT engine, and access custom Inworld voices — all from a single session. Swap any component at any time.

  • OpenAI, Anthropic, Google, Groq, Mistral, xAI, and more
  • Choose STT provider independently of LLM
  • Access to all Inworld built-in voices as well as your cloned and custom voices
Configure and try
session.json
{
"type": "realtime",
"modelId": "anthropic/claude-sonnet-4-6",
"stt": { "model": "inworld/stt-1" },
"audio": { "voice": "Sarah" }
}

Fluent tool calling for agents

Register tools at session start or add them on the fly. The assistant calls your functions mid-conversation without breaking the audio stream.

  • Declare tools once — the agent invokes them when needed
  • Utilize our built-in web search as well as any custom tool you define.
  • Audio stays open while tools execute and results stream back
Build with it
Voice Stream
Realtime API
Your Tools
get_booking()
result streamed back
update_crm()
confirmed
check_weather()
result streamed back
Audio stays open throughout

Use cases

The Realtime API powers any application where voice is the primary interface.

Inworld Realtime API vs OpenAI Realtime API

Drop-in compatible with the OpenAI Realtime API. More flexible, more models, better pricing.
Capability
Realtime API
OpenAI Realtime
OpenAI SDK compatible
Sub-second latency
LLM choice
Hundreds of models
GPT-4o only
TTS quality
#1 ranked TTS on Artificial Analysis
Built-in only
Custom voices
Built-in + cloned + custom
6 preset voices
Function calling
Semantic turn detection
Conversational intelligence
Emotion, age, accent
Transport options
WebSocket, WebRTC
WebSocket, WebRTC
Pricing (per minute)
From $0.015/min
From $0.06/min
Provider lock-in
None — swap models anytime
OpenAI only

FAQ

Yes. The Realtime API is fully compatible with the OpenAI Realtime API, so you can migrate by simply swapping the endpoint and auth credentials. A full migration guide is available here.
When using the realtime API, you only pay for the underlying model usage. Rates for all models are available here. Inworld gives you built-in tools to manage costs, like capping response length, canceling responses early, and trimming conversation history, so you stay in full control of your spend.
The Realtime API supports the languages available through the underlying models you select.
By default, you can run up to 20 concurrent conversations, with up to 1,000 requests per second shared across them. Need more? Contact our team to discuss higher limits for your use case.
The Realtime API gives you access to hundreds of models from leading providers, such as OpenAI, Anthropic, Google, Mistral, xAI, and more. You can pick the best model for your application without being locked into a single provider.
WebSocket is currently publicly available, with WebRTC and SIP support in early access. Please reach out to our team if you’d like access.

Start building in minutes

Get an API key, open a WebSocket, stream audio.
Copyright © 2021-2026 Inworld AI
Realtime API: Low-Latency Speech-to-Speech AI for Developers