Voice Agents

Build voice agents that sound like real people

Q: What makes the voice the most natural?

Realtime TTS is independently ranked at the top of the Artificial Analysis Speech Arena, a human-vote leaderboard with Inworld voices holding three of the top five spots. Voice Agents uses the same TTS engine, so your agent inherits that quality on the first call with no separate integration.

Q: Which LLMs can I use with the voice agent API?

Any LLM available through the Inworld Router, including OpenAI GPT, Anthropic Claude, Google Gemini, Meta Llama, Mistral, xAI Grok, Groq, and Fireworks. Switch between them by changing one field on the session configuration. See the pricing page for current model rates.

Q: What's the end-to-end latency for a voice agent response?

First audio lands in under a second from the moment the caller stops speaking. The voice layer alone hits under 200ms; the rest is your LLM thinking. Swap to a faster model through the Router when you need speed over depth.

Q: Can I connect this to a phone system?

Yes. WebSocket, WebRTC, and SIP are all supported on the same endpoint. Bring your own carrier or use published Twilio and Telnyx integration examples. G.711 μ-law and A-law audio formats are supported for call-center infrastructure.

Q: Does function calling interrupt the audio stream?

Yes. Declare tools in the session.update message. When the agent calls a function, the audio connection stays open, your server returns the result via conversation.item.create, and the agent speaks the response with no handoff gap.

Q: How does the agent handle interruptions?

Semantic VAD detects when the user starts speaking and automatically cancels the in-flight response. You configure eagerness (low / medium / high / auto) on turn_detection. Barge-in works out of the box, no custom logic needed.

Q: How many concurrent voice agent sessions can I run?

Default limits are 20 concurrent sessions per account and 1,000 packets per second shared across them. Need more for production? Contact the team to discuss higher limits.

Q: Is this generally available?

The Realtime API is in research preview. WebSocket is generally available; WebRTC and SIP are in early access, reach out for access.

Q: Is this compatible with the OpenAI Realtime API?

Yes. The Realtime API follows the OpenAI Realtime protocol, same event names, same message flow, same SDK patterns. If you built on gpt-realtime, swap the base URL and the existing code keeps working. See the migration guide for details.

The #1 realtime voice on Artificial Analysis wired into a full-duplex Realtime API. OpenAI-compatible, under a second end-to-end, any LLM behind it.

Start building See the API

Live voice agent

Caller

turn_detection semantic_vadstt inworld/inworld-stt-1

I need to move my dinner reservation to 8pm.

Agent

session.model openai/gpt-5.4voice Clivetool updateReservation

Done. Dinner for four moved to 8pm. I also bumped your table to the window.

Works with

Realtime API Router TTS STT

Trusted by

StatusTalkpalBible ChatDeath by AI

Voice quality, on every turn.

#1 realtime voice, sub-second latency, any LLM behind one session.

#1 realtime voice

Works with

Realtime API

TTS

The top-ranked voice, now for agents.

Realtime TTS ranks #1 on the Artificial Analysis Speech Arena, the vote-driven voice AI benchmark. Three of the top five voices are ours. Every agent inherits it on the first call.

Artificial Analysis · Speech Arena

Realtime TTS 1.5 Max

ElevenLabs Eleven v3

Realtime TTS 1 Max

Realtime TTS 1.5 Mini

3 of the top 5 are Inworld

#1 realtime voice

Works with

Realtime API

TTS

The top-ranked voice, now for agents.

Realtime TTS ranks #1 on the Artificial Analysis Speech Arena, the vote-driven voice AI benchmark. Three of the top five voices are ours. Every agent inherits it on the first call.

Artificial Analysis · Speech Arena

Realtime TTS 1.5 Max

ElevenLabs Eleven v3

Realtime TTS 1 Max

Realtime TTS 1.5 Mini

3 of the top 5 are Inworld

Conversational, not scripted

Works with

Realtime API

TTS

It doesn't read the words. It talks to them.

Semantic turn-taking reads the pause the way a person does. Barge-in lets callers interrupt without breaking flow. Tuned for live conversation, not narration.

Live call · semantic VAD

User

So I was hoping to move my dinner to eight…

agent paused · user still speaking

User

…actually nine. Can we do nine?

Agent

Nine it is. Same table, window side.

Live call · semantic VAD

User

So I was hoping to move my dinner to eight…

agent paused · user still speaking

User

…actually nine. Can we do nine?

Agent

Nine it is. Same table, window side.

Conversational, not scripted

Works with

Realtime API

TTS

It doesn't read the words. It talks to them.

Semantic turn-taking reads the pause the way a person does. Barge-in lets callers interrupt without breaking flow. Tuned for live conversation, not narration.

Under a second, end-to-end

A conversation pace, not a buffering pace.

Full-duplex streaming over WebSocket, WebRTC, or SIP. First audio lands in under a second end-to-end. The pause your users hear is a beat, not a gap.

Voice agent · first audio chunk

<1s

Median end-to-end

STT → Router → TTS. Conversation, not latency.

Under a second, end-to-end

A conversation pace, not a buffering pace.

Full-duplex streaming over WebSocket, WebRTC, or SIP. First audio lands in under a second end-to-end. The pause your users hear is a beat, not a gap.

Voice agent · first audio chunk

<1s

Median end-to-end

STT → Router → TTS. Conversation, not latency.

Any model, one session

Works with

Realtime API

Router

Swap brains without rewiring the agent.

Route to GPT, Claude, Gemini, Llama, or any of hundreds of models through a single session endpoint. Switch providers with one field. Reasoning is never locked to a vendor.

Explore the Router

Pick any model. Swap any time.

session.model

OpenAI

gpt-5.4

Anthropic

claude-sonnet-4-6

Google

gemini-3.1-pro

Swap brains without rewiring the agent.

Route to GPT, Claude, Gemini, Llama, or any of hundreds of models through a single session endpoint. Switch providers with one field. Reasoning is never locked to a vendor.

Explore the Router

Tools, no handoff

Function calls that never break the audio.

Your agent can pull live data mid-call and keep talking. The line stays open while your server does the work, and the answer arrives in the same breath.

agent.ts · function call

// tools declared at session.updateif (event.type === "response.function_call") {  const result = await getBooking(args);  ws.send({ type: "conversation.item.create", ... });}// audio stream never closes during the tool call.

Tools, no handoff

Function calls that never break the audio.

Your agent can pull live data mid-call and keep talking. The line stays open while your server does the work, and the answer arrives in the same breath.

agent.ts · function call

// tools declared at session.updateif (event.type === "response.function_call") {  const result = await getBooking(args);  ws.send({ type: "conversation.item.create", ... });}// audio stream never closes during the tool call.

Phone, browser, server

Deploy on every channel users call from.

The same agent runs in the browser, on your server, and over the phone. Bring your own carrier or plug into published Twilio and Telnyx integrations.

One agent. Every channel.

Phone

SIP · Twilio / Telnyx

Browser

WebRTC

Server

WebSocket

Voice agent

Your LLM + Inworld voice. Same session config.

Bring your own carrier. Twilio and Telnyx integrations published.

One agent. Every channel.

Phone

SIP · Twilio / Telnyx

Browser

WebRTC

Server

WebSocket

Voice agent

Your LLM + Inworld voice. Same session config.

Bring your own carrier. Twilio and Telnyx integrations published.

Phone, browser, server

Deploy on every channel users call from.

The same agent runs in the browser, on your server, and over the phone. Bring your own carrier or plug into published Twilio and Telnyx integrations.

A voice agent, in forty lines

Connect, configure, stream. Your LLM, our voice, human turn-taking out of the box.

// 1. Connect (same endpoint for browser, server, SIP)
const ws = new WebSocket(
  'wss://api.inworld.ai/api/v1/realtime/session?key=' + sessionId,
  ['realtime']
);
ws.addEventListener('open', () => {});

ws.addEventListener('message', async (event) => {
  const msg = JSON.parse(event.data);

  // 2. Configure on session.created
  if (msg.type === 'session.created') {
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        type: 'realtime',
        model: 'openai/gpt-5.4',            // any LLM
        instructions: 'You are a helpful voice agent.',
        output_modalities: ['audio', 'text'],
        audio: {
          input: {
            turn_detection: {
              type: 'semantic_vad',
              eagerness: 'medium',
              create_response: true,
              interrupt_response: true,
            },
          },
          output: {
            model: 'inworld-tts-2',  // Realtime TTS-2 (research preview)
            voice: 'Clive',
          },
        },
      },
    }));
  }

  // 3. Play audio deltas as they stream
  if (msg.type === 'response.output_audio.delta') {
    audioQueue.push(base64ToPcm16(msg.delta));
    if (!isPlaying) playNext();
  }
});

// Stream mic audio in (semantic VAD handles turn detection)
mic.on('data', (chunk) => {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: chunk.toString('base64'),
  }));
});

// 1. Connect (same endpoint for browser, server, SIP)
const ws = new WebSocket(
  'wss://api.inworld.ai/api/v1/realtime/session?key=' + sessionId,
  ['realtime']
);
ws.addEventListener('open', () => {});

ws.addEventListener('message', async (event) => {
  const msg = JSON.parse(event.data);

  // 2. Configure on session.created
  if (msg.type === 'session.created') {
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        type: 'realtime',
        model: 'openai/gpt-5.4',            // any LLM
        instructions: 'You are a helpful voice agent.',
        output_modalities: ['audio', 'text'],
        audio: {
          input: {
            turn_detection: {
              type: 'semantic_vad',
              eagerness: 'medium',
              create_response: true,
              interrupt_response: true,
            },
          },
          output: {
            model: 'inworld-tts-2',  // Realtime TTS-2 (research preview)
            voice: 'Clive',
          },
        },
      },
    }));
  }

  // 3. Play audio deltas as they stream
  if (msg.type === 'response.output_audio.delta') {
    audioQueue.push(base64ToPcm16(msg.delta));
    if (!isPlaying) playNext();
  }
});

// Stream mic audio in (semantic VAD handles turn detection)
mic.on('data', (chunk) => {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: chunk.toString('base64'),
  }));
});

Hear it before you build it.

Open the Realtime Playground, pick a voice, pick an LLM, paste instructions, and click Connect. Speak to your agent, hear the response land in under a second, swap the voice mid-call. When it sounds right, copy the session config into your code.

Open the playground

FAQ