































































Pick any LLM for the conversation engine. Swap providers without changing your integration.
// Configure your realtime session
ws.send(JSON.stringify({
"type": "session.update",
"session": {
"type": "realtime",
"modelId": "anthropic/claude-sonnet-4-6",
"instructions": "You are a helpful voice agent.",
"output_modalities": ["audio", "text"],
"audio": {
"output": {
"model": "inworld-tts-1.5-max",
"voice": "Sarah"
}
}
}
}));// Configure your realtime session
ws.send(JSON.stringify({
"type": "session.update",
"session": {
"type": "realtime",
"modelId": "anthropic/claude-sonnet-4-6",
"instructions": "You are a helpful voice agent.",
"output_modalities": ["audio", "text"],
"audio": {
"output": {
"model": "inworld-tts-1.5-max",
"voice": "Sarah"
}
}
}
}));Pick any LLM for the conversation engine. Swap providers without changing your integration.
Optimized data flow delivers end-to-end speech-to-speech latency under one second. Voice agents respond with human-level cadence.
Optimized data flow delivers end-to-end speech-to-speech latency under one second. Voice agents respond with human-level cadence.
Context-aware semantic VAD with adjustable eagerness. The agent knows when to listen, when to speak, and when a user is interrupting.


Context-aware semantic VAD with adjustable eagerness. The agent knows when to listen, when to speak, and when a user is interrupting.
Inworld's STT generates voice personas — emotion, age, accent, and speaking rate — alongside transcriptions. These signals are automatically used by the LLM Router and TTS to improve generation quality.
Inworld's STT generates voice personas — emotion, age, accent, and speaking rate — alongside transcriptions. These signals are automatically used by the LLM Router and TTS to improve generation quality.
Route to hundreds of LLMs, choose your STT engine, and access custom Inworld voices — all from a single session. Swap any component at any time.

{"type": "realtime","modelId": "anthropic/claude-sonnet-4-6","stt": { "model": "inworld/stt-1" },"audio": { "voice": "Sarah" }}

{"type": "realtime","modelId": "anthropic/claude-sonnet-4-6","stt": { "model": "inworld/stt-1" },"audio": { "voice": "Sarah" }}
Route to hundreds of LLMs, choose your STT engine, and access custom Inworld voices — all from a single session. Swap any component at any time.
Register tools at session start or add them on the fly. The assistant calls your functions mid-conversation without breaking the audio stream.
Register tools at session start or add them on the fly. The assistant calls your functions mid-conversation without breaking the audio stream.

Capability | Realtime API | OpenAI Realtime |
|---|---|---|
OpenAI SDK compatible | ||
Sub-second latency | ||
LLM choice | Hundreds of models | GPT-4o only |
TTS quality | #1 ranked TTS on Artificial Analysis | Built-in only |
Custom voices | Built-in + cloned + custom | 6 preset voices |
Function calling | ||
Semantic turn detection | ||
Conversational intelligence | Emotion, age, accent | |
Transport options | WebSocket, WebRTC | WebSocket, WebRTC |
Pricing (per minute) | From $0.015/min | From $0.06/min |
Provider lock-in | None — swap models anytime | OpenAI only |
