What is a voice agent API, and how do you pick the right one? A voice agent API handles the full voice loop: audio in, language model processing, audio out. Inworld AI's Realtime API runs this entire pipeline over a single WebSocket connection. It pairs Realtime STT, Realtime Router (routing to hundreds of LLMs), and Realtime TTS (ranked #1 on the
Artificial Analysis Speech Arena, ELO ~1,208) into one session with built-in VAD, turn detection, barge-in, and function calling. This guide breaks down the architecture decisions behind choosing a voice agent API for production.
This is not a tutorial. If you want working code in 2 minutes, read
How to Build an AI Voice Agent. If you want a step-by-step walkthrough, read
Build a Voice Agent in 30 Minutes. This page is for engineers evaluating which stack to bet on.
How does a voice agent API work under the hood?
Every voice agent runs the same core loop:
- Capture raw audio from the user (microphone, telephony, WebRTC)
- Transcribe speech to text (STT)
- Reason over the transcript with a language model (LLM)
- Synthesize the response back to speech (TTS)
- Stream audio back to the user while handling interruptions
The difference between voice agent APIs comes down to how they compose these stages and how much control you get over each one.
There are three architectural approaches shipping today:
What makes the Inworld Realtime API different from GPT-Realtime-2?
OpenAI shipped GPT-Realtime-2 on May 7, 2026 with native audio understanding, 128K context, and GPT-5-level reasoning. It is a strong product for teams that want GPT-5 specifically and do not need to swap models.
The Inworld Realtime API takes a different architectural bet: model-agnostic pipeline. Realtime Router sits between STT and TTS, routing to hundreds of LLMs from OpenAI, Anthropic, Google, Groq, and Fireworks. You can swap the LLM mid-session without reconnecting. You can run A/B tests across models with sticky user routing. And you get Realtime TTS for voice output, which ranks #1 on the Artificial Analysis Speech Arena.
The key question is whether you want a vertically integrated model or a composable pipeline.
Tool calling and instruction following. GPT-Realtime-2 handles audio natively, which is impressive for conversational fluency, but tool calling accuracy and instruction following are significantly weaker than text-mode GPT-5. The native audio pipeline trades precision for naturalness. With the Inworld Realtime API, reasoning runs through text-mode LLMs (via Router) where tool calling and instruction following are mature and reliable, while TTS handles voice quality separately. You get the best of both: reliable function execution and #1 ranked voice quality.
Neither approach is universally better. If your use case is simple voice conversation without complex tool calls, GPT-Realtime-2 is simpler. If you need reliable tool calling, instruction following, model flexibility, or top-ranked TTS quality, the Realtime API gives you that.
How does the Realtime API session work?
The connection flow is straightforward. Connect to the WebSocket, receive session.created, send your configuration via session.update, then stream audio bidirectionally.
const ws = new WebSocket(
'wss://api.inworld.ai/api/v1/realtime/session?key=my-session&protocol=realtime',
{ headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
);
ws.on('open', () => {
console.log('Connected to Realtime API');
});
ws.on('message', (raw) => {
const event = JSON.parse(raw.toString());
if (event.type === 'session.created') {
// Configure the voice agent
ws.send(JSON.stringify({
type: 'session.update',
session: {
type: 'realtime',
instructions: 'You are a helpful support agent.',
model: 'gpt-5.4',
output_modalities: ['audio', 'text'],
temperature: 0.7,
audio: {
output: {
voice: 'Sarah',
model: 'inworld-tts-1.5-max',
speed: 1.0,
},
},
},
}));
}
if (event.type === 'session.updated') {
console.log('Session configured. Ready to stream audio.');
}
if (event.type === 'response.output_audio.delta') {
// Decode and play base64 PCM16 audio chunk
const audio = Buffer.from(event.delta, 'base64');
// ... write to speaker or forward to client
}
if (event.type === 'input_audio_buffer.speech_started') {
// User is speaking - stop playback for barge-in
}
if (event.type === 'response.function_call_arguments.done') {
// Handle tool call, send result back
const args = JSON.parse(event.arguments);
ws.send(JSON.stringify({
type: 'conversation.item.create',
item: {
type: 'function_call_output',
call_id: event.call_id,
output: JSON.stringify({ result: 'your tool output here' }),
},
}));
ws.send(JSON.stringify({ type: 'response.create' }));
}
});
Key things to notice in this configuration:
model: 'gpt-5.4' sets the LLM via Realtime Router. Change this to anthropic/claude-opus-4-7, auto, or any model the Router supports. The TTS and STT layers stay the same.
voice: 'Sarah' and model: 'inworld-tts-1.5-max' inside audio.output configure TTS independently from the LLM. These are Realtime WebSocket fields (voice/model), not REST TTS fields (voiceId/modelId).
output_modalities: ['audio', 'text'] returns both spoken audio and a text transcript. Drop 'text' if you only need audio.
- Function calling works across any LLM the Router supports. Register tools in the
session.update payload and handle response.function_call_arguments.done events.
For the full protocol reference including all event types, see the
Inworld Realtime WebSocket docs.
Can you swap the LLM mid-session?
Yes. Send a new session.update with a different model value. The WebSocket stays open, the TTS voice stays the same, and the next response uses the new LLM.
// Switch to Claude mid-session without reconnecting
ws.send(JSON.stringify({
type: 'session.update',
session: {
model: 'anthropic/claude-opus-4-7',
},
}));
// Or use Realtime Router for automatic selection
ws.send(JSON.stringify({
type: 'session.update',
session: {
model: 'auto',
},
}));
This is the core architectural difference. In a single-model system like GPT-Realtime-2, switching models means switching providers. In the Realtime API, the LLM is a configuration parameter.
Practical scenarios where this matters:
- A/B testing models in production. Realtime Router supports variant weights and sticky user routing. Route 50% of sessions to GPT-5.4 and 50% to Claude Opus 4.7, measure which performs better for your use case.
- Cost optimization. Route simple queries to a fast, cheap model. Route complex queries to a premium model. The Router handles this automatically when you set
model: 'auto' with sort strategies.
- Failover. If one provider has an outage, Router falls back to the next model in the chain without dropping the session.
How does Inworld fit under voice agent frameworks?
If you are using a framework like Vapi, Pipecat, or LiveKit, you do not have to choose between the framework and Inworld. Inworld products work as infrastructure components underneath these frameworks.
Example: using Realtime TTS as a streaming provider in a Python-based pipeline:
# Inworld Realtime TTS as a provider in Pipecat
# Works with any framework that accepts an OpenAI-compatible TTS endpoint
import aiohttp
import base64
import json
async def inworld_tts_stream(text: str, voice: str = 'Sarah'):
"""Stream TTS audio from Inworld REST API."""
async with aiohttp.ClientSession() as session:
resp = await session.post(
'https://api.inworld.ai/tts/v1/voice:stream',
headers={
'Authorization': f'Basic {INWORLD_API_KEY}',
'Content-Type': 'application/json',
},
json={
'text': text,
'voiceId': 'Sarah',
'modelId': 'inworld-tts-1.5-max',
'audioConfig': {
'audioEncoding': 'PCM',
'sampleRateHertz': 24000,
},
},
)
async for line in resp.content:
if line.strip():
chunk = json.loads(line)
audio = base64.b64decode(
chunk['result']['audioContent']
)
yield audio
Note the field names: the REST TTS API uses voiceId and modelId (not voice and model, which are WebSocket fields). The streaming endpoint returns NDJSON where each line contains result.audioContent as base64. Decode before writing to disk or forwarding to a client.
What does semantic VAD actually do?
Turn detection is what separates a voice agent from a voice chatbot. Basic VAD (voice activity detection) listens for silence. If the user pauses for 500ms, it assumes they are done talking. This cuts people off mid-thought constantly.
Semantic VAD uses conversational context to determine when a user has finished a complete thought. The Realtime API supports both modes:
- Semantic VAD (
semantic_vad): waits for a semantically complete pause. Configurable eagerness: low, medium, high, auto. Low eagerness waits longer, which works better for complex queries. High eagerness responds faster, which suits quick-fire interactions.
- Server VAD (
server_vad): amplitude-based threshold detection. Lower latency, but more likely to interrupt mid-sentence.
When create_response is set to true in the turn detection config, the API automatically triggers response generation when it detects the user has finished speaking. Combined with interrupt_response: true, the agent stops speaking when the user barges in. This two-way interruption handling is built into the protocol.
What should you evaluate when choosing a voice agent API?
If you are comparing voice agent APIs for a production deployment, here is a framework for the decision.
Voice quality. Your users hear TTS output on every interaction. Quality directly affects trust and engagement. Check the
Artificial Analysis Speech Arena leaderboard for current rankings. As of this writing, Realtime TTS 1.5 Max holds the #1 position (ELO ~1,208).
Latency budget. End-to-end latency is STT + LLM inference + TTS first byte. A unified pipeline eliminates the network hops between services. A DIY stack adds one round trip per stage. Realtime TTS 1.5 Mini delivers sub-130ms P90 TTS latency, which leaves more of the latency budget for LLM inference.
Model flexibility. Will you always use the same LLM? If yes, a single-model solution is simpler. If you need to experiment, optimize costs, or avoid provider lock-in, a model-agnostic pipeline matters.
Turn detection quality. Ask for a demo with real conversational audio, not scripted examples. Semantic VAD versus fixed-threshold VAD is the difference between a natural conversation and an agent that interrupts you.
Protocol support. WebSocket covers most server-side integrations. WebRTC matters for browser-native applications with strict latency requirements. SIP matters for telephony. Check which protocols each API supports.
Function calling. Voice agents that can take actions (book appointments, query databases, trigger workflows) are significantly more useful. Verify that tool calling works mid-session without breaking the audio stream.
How do you get started?
- Sign up at platform.inworld.ai and generate an API key.
- Try the Realtime API with the 2-minute quickstart or the 30-minute tutorial.
- Read the protocol docs for the full event reference: WebSocket guide, session configuration.
- Use Realtime TTS or Router independently if you already have a framework. Drop them in as providers without changing your architecture.
Frequently Asked Questions
What is a voice agent API?
A voice agent API handles the full loop of voice interaction: speech recognition, language model processing, and speech synthesis. Instead of stitching together separate STT, LLM, and TTS services, a voice agent API manages the pipeline over a single connection with built-in turn detection and interruption handling.
How does the Inworld Realtime API work?
The Realtime API accepts streaming audio over a persistent WebSocket connection and returns synthesized audio. Internally it runs Realtime STT, routes to any LLM via Realtime Router, and generates speech with Realtime TTS. VAD, turn detection, barge-in, and function calling are all handled server-side.
What is the difference between Inworld Realtime API and OpenAI GPT-Realtime-2?
Inworld Realtime API is model-agnostic. It routes to hundreds of LLMs via Realtime Router, uses Realtime TTS (ranked #1 on the Artificial Analysis Speech Arena), and lets you swap any component independently. OpenAI GPT-Realtime-2 is a single integrated model that handles audio natively with 128K context and GPT-5 reasoning, but locks you into one provider for all three stages.
Can I use Inworld Realtime API with voice agent frameworks like Vapi and Pipecat?
Yes. Inworld products work as infrastructure under frameworks. Use Realtime TTS as the TTS provider in Pipecat or LiveKit, Realtime Router as the LLM endpoint in any OpenAI-compatible framework, or the full Realtime API as a drop-in speech-to-speech backend.
How do I build a voice agent with the Inworld Realtime API?
Open a WebSocket to
wss://api.inworld.ai/api/v1/realtime/session with your API key, send a
session.update event with your system prompt, TTS voice, and LLM model, then stream PCM16 audio in and receive synthesized audio back. The API handles STT, routing, TTS, VAD, and interruption logic. See our
quickstart guide for runnable code.
What latency should I expect from a voice agent API?
With Realtime TTS 1.5 Mini, expect sub-130ms P90 TTS latency. End-to-end voice-in to voice-out depends on LLM inference time and network conditions, but the single-connection architecture avoids the inter-service overhead of a DIY stack.
Is the Inworld Realtime API compatible with OpenAI's realtime protocol?
Yes. The event system follows the OpenAI Realtime protocol. Event types, session configuration shapes, and message structures are consistent. Teams using OpenAI Realtime can
migrate with minimal code changes.
What is semantic VAD and why does it matter for voice agents?
Semantic VAD uses conversational context to detect when a user has finished speaking, rather than relying on a fixed silence threshold. This prevents the agent from cutting in during natural pauses and produces more natural turn-taking. The Realtime API supports semantic VAD with configurable eagerness levels (low, medium, high, auto).