Get started
Voice Agents

Build voice agents that sound like real people

The #1 ranked voice on Artificial Analysis wired into a full-duplex Realtime API. OpenAI-compatible, under a second end-to-end, any LLM behind it.
Live voice agent
Caller
turn_detection semantic_vadstt inworld/inworld-stt-1

I need to move my dinner reservation to 8pm.

Agent
session.model openai/gpt-5.4voice Clivetool updateReservation

Done. Dinner for four moved to 8pm. I also bumped your table to the window.

Trusted by
StatusTalkpalBible ChatDeath by AI

Voice quality, on every turn.

#1 ranked voice, sub-second latency, any LLM behind one session.
#1 ranked most natural
Works with
Realtime APITTS

The top-ranked voice, now for agents.

Inworld TTS ranks #1 on the Artificial Analysis Speech Arena, the vote-driven voice AI benchmark. Three of the top five voices are ours. Every agent inherits it on the first call.
Artificial Analysis · Speech Arena
#1
Inworld TTS 1.5 Max
#2
ElevenLabs v3
#3
Inworld TTS 1 Max
#5
Inworld TTS 1.5 Mini
3 of the top 5 are Inworld
Conversational, not scripted
Works with
Realtime APITTS

It doesn't read the words. It talks to them.

Semantic turn-taking reads the pause the way a person does. Barge-in lets callers interrupt without breaking flow. Tuned for live conversation, not narration.
Live call · semantic VAD
User
So I was hoping to move my dinner to eight…
agent paused · user still speaking
User
…actually nine. Can we do nine?
Agent
Nine it is. Same table, window side.
Under a second, end-to-end

A conversation pace, not a buffering pace.

Full-duplex streaming over WebSocket, WebRTC, or SIP. First audio lands in under a second end-to-end. The pause your users hear is a beat, not a gap.
Voice agent · first audio chunk
<1s
Median end-to-end
STT → Router → TTS. Conversation, not latency.
Any model, one session

Swap brains without rewiring the agent.

Route to GPT, Claude, Gemini, Llama, or any of hundreds of models through a single session endpoint. Switch providers with one field. Reasoning is never locked to a vendor.
Explore the Router
Pick any model. Swap any time.
session.model
OpenAI
gpt-5.4
Anthropic
claude-sonnet-4-6
Google
gemini-3.1-pro
Meta
llama-4-maverick
Mistral
medium-2508
xAI
grok-4.20
Groq
gpt-oss-120b
Fireworks
deepseek-v3-2
Hundreds more through the same session endpoint.
Tools, no handoff

Function calls that never break the audio.

Your agent can pull live data mid-call and keep talking. The line stays open while your server does the work, and the answer arrives in the same breath.
agent.ts · function call
// tools declared at session.updateif (event.type === "response.function_call") {  const result = await getBooking(args);  ws.send({ type: "conversation.item.create", ... });}// audio stream never closes during the tool call.
Phone, browser, server

Deploy on every channel users call from.

The same agent runs in the browser, on your server, and over the phone. Bring your own carrier or plug into published Twilio and Telnyx integrations.
One agent. Every channel.
Phone
SIP · Twilio / Telnyx
Browser
WebRTC
Server
WebSocket
Voice agent
Your LLM + Inworld voice. Same session config.
Bring your own carrier. Twilio and Telnyx integrations published.

A voice agent, in forty lines

Connect, configure, stream. Your LLM, our voice, human turn-taking out of the box.
// 1. Connect (same endpoint for browser, server, SIP) const ws = new WebSocket( 'wss://api.inworld.ai/api/v1/realtime/session?key=' + sessionId, ['realtime'] ); ws.addEventListener('open', () => {}); ws.addEventListener('message', async (event) => { const msg = JSON.parse(event.data); // 2. Configure on session.created if (msg.type === 'session.created') { ws.send(JSON.stringify({ type: 'session.update', session: { type: 'realtime', model: 'openai/gpt-5.4', // any LLM instructions: 'You are a helpful voice agent.', output_modalities: ['audio', 'text'], audio: { input: { turn_detection: { type: 'semantic_vad', eagerness: 'medium', create_response: true, interrupt_response: true, }, }, output: { model: 'inworld-tts-1.5-max', // top-ranked voice voice: 'Clive', }, }, }, })); } // 3. Play audio deltas as they stream if (msg.type === 'response.output_audio.delta') { audioQueue.push(base64ToPcm16(msg.delta)); if (!isPlaying) playNext(); } }); // Stream mic audio in (semantic VAD handles turn detection) mic.on('data', (chunk) => { ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: chunk.toString('base64'), })); });

Hear it before you build it.

Open the Realtime Playground, pick a voice, pick an LLM, paste instructions, and click Connect. Speak to your agent, hear the response land in under a second, swap the voice mid-call. When it sounds right, copy the session config into your code.
Open the playground

FAQ

Inworld TTS is independently ranked at the top of the Artificial Analysis Speech Arena, a human-vote leaderboard with Inworld voices holding three of the top five spots. Voice Agents uses the same TTS engine, so your agent inherits that quality on the first call with no separate integration.
Any LLM available through the Inworld Router, including OpenAI GPT, Anthropic Claude, Google Gemini, Meta Llama, Mistral, xAI Grok, Groq, and Fireworks. Switch between them by changing one field on the session configuration. See the pricing page for current model rates.
First audio lands in under a second from the moment the caller stops speaking. The voice layer alone hits under 200ms; the rest is your LLM thinking. Swap to a faster model through the Router when you need speed over depth.
Yes. WebSocket, WebRTC, and SIP are all supported on the same endpoint. Bring your own carrier or use published Twilio and Telnyx integration examples. G.711 μ-law and A-law audio formats are supported for call-center infrastructure.
Yes. Declare tools in the session.update message. When the agent calls a function, the audio connection stays open, your server returns the result via conversation.item.create, and the agent speaks the response with no handoff gap.
Semantic VAD detects when the user starts speaking and automatically cancels the in-flight response. You configure eagerness (low / medium / high / auto) on turn_detection. Barge-in works out of the box, no custom logic needed.
Default limits are 20 concurrent sessions per account and 1,000 packets per second shared across them. Need more for production? Contact the team to discuss higher limits.
The Realtime API is in research preview. WebSocket is generally available; WebRTC and SIP are in early access, reach out for access.
Yes. The Realtime API follows the OpenAI Realtime protocol, same event names, same message flow, same SDK patterns. If you built on gpt-realtime, swap the base URL and the existing code keeps working. See the migration guide for details.

Build an agent that sounds like a person.

The top-ranked voice. Sub-second latency. Any LLM through the Router. Ship the call today.
Copyright © 2021-2026 Inworld AI
Voice Agents API: The #1 Ranked Most Natural Voice for AI Agents | Inworld AI