Build a Realtime AI Voice Agent Without Stitching APIs

Q: How do I reduce latency for my voice agent?

Start by using a low-latency TTS model such as inworld-tts-1.5-mini, keeping audio chunks around 100-200ms, avoiding heavy client-side processing, and deploying your proxy close to Inworld's API region.

Q: How does barge-in (interrupting the agent) work?

When the API detects user speech, it emits input_audio_buffer.speech_started. In this demo the client stops playback immediately to prevent talking over the user. In production you typically combine this with server-side interruption handling so agent generation halts promptly.

Q: Can I change the voice?

Yes. Set audio.output.voice in the session.update payload. Inworld offers multiple built-in voices and also supports custom voices. In this example, we used Clive, but you can try any of the other voices we offer.

Q: Can I change the underlying LLM?

Yes. Set session.model in the session configuration to specify a different language model. You can also use Inworld's Router for routing, cohort handling, and experiments.

Q: Can I add tools or function calling?

Yes. Register tools in the tools array in your session.update payload, then handle tool calls when the API sends completed tool arguments events.

Building voice AI agents used to be cumbersome, with developers needing to juggle speech-to-text, LLM text processing, and text-to-speech. Each of these pieces would need to handle streaming, WebSockets, and interruptions in order to feel natural for users.

At Inworld, we just launched our Realtime API to make building AI voice agents much easier. The Realtime API is a speech-to-speech API that takes speech as an input and returns speech as an output. Now, creating a voice agent with the leading models is as easy as an API call.

In this guide, we'll walk through an example of setting up a realtime voice agent with the Inworld Realtime API, overview some of the customization options, and get a working prototype in 2 minutes of coding. You can follow along with the article step by step, or just feed Claude this article and it will one shot the solution.

What This Guide Builds

This guide builds a very simple voice agent UX with a button to start a conversation.

This is a very lightweight implementation with two files:

server.js is a lightweight Node.js WebSocket proxy that authenticates with the Inworld Realtime API and keeps credentials server-side.
index.html is a minimal frontend that captures microphone audio and plays streamed agent audio back to the user.

On first connection, the agent will deliver a voice greeting. That greeting proves the full voice loop is functional: text instruction in, synthesized audio out, streamed to the browser in realtime. You can also ask follow up questions to the agent for it to continue the conversation.

Setup

You'll need Node.js installed, an Inworld account, and an API key generated in the Inworld Portal.

Step 1: Implement the Voice Agent with Inworld

First, we'll create a server for setting up the Realtime API. The server will:

Setup a WebSocketServer
Configure our Voice Agent
Add an instructional message to our agent
Send voice messages back and forth to Inworld's Realtime API via websockets

Set Up the WebSocket Server

Start with the imports and basic HTTP/WebSocket scaffolding:

import { readFileSync } from 'fs';
import { createServer } from 'http';
import { WebSocketServer, WebSocket } from 'ws';

const html = readFileSync('index.html');

const INWORLD_API_KEY = process.env.INWORLD_API_KEY;
if (!INWORLD_API_KEY) {
  console.error('Missing INWORLD_API_KEY environment variable.');
  process.exit(1);
}

const server = createServer((req, res) => {
  res.writeHead(200, { 'Content-Type': 'text/html' });
  res.end(html);
});

const wss = new WebSocketServer({ server, path: '/ws' });

We read index.html (the frontend we'll setup later) into memory once at startup and serve it on every HTTP request. The WebSocketServer listens on the /ws path, which is where the frontend will connect.

Configure the Voice Agent's Personality

Next, we define the session configuration that tells Inworld how the agent should behave:

const SESSION_CFG = JSON.stringify({
  type: "session.update",
  session: {
    type: "realtime",
    instructions:
      "You are an assistant from Inworld AI, the voice AI company. You are excited to talk about inworld ai.",
    output_modalities: ["audio", "text"],
    temperature: 0.8,
    audio: {
      output: {
        voice: "Clive",
        model: "inworld-tts-1.5-max",
        speed: 1.0,
      },
    },
  },
});

Here we configured the default settings of our agent. I'll walk through each argument:

type: "realtime" sets the session mode explicitly. This tells Inworld you want a persistent, bidirectional audio session rather than a one-shot request/response interaction.
instructions is the system prompt. It shapes the agent's personality and knowledge. For this demo, we're making the agent an enthusiastic Inworld representative, but in production you'd replace this with your own product context, guardrails, or persona definition.
output_modalities: ["audio", "text"] means the agent produces both spoken audio and a text transcript. Including "text" is useful for debugging and for displaying captions in the frontend. If you only need audio, you can drop "text" to reduce payload size slightly.
temperature: 0.8 nudges the model toward more varied, conversational responses. Lower values produce more deterministic output. For a voice agent that should feel natural in conversation, 0.8 is a reasonable starting point.
voice: "Clive" selects one of Inworld's built-in voices. Inworld offers a large selection of voices out of the box with the ability to create your own as well. In this example, I chose Clive.
model: "inworld-tts-1.5-max" selects the TTS model. The max variant is optimized for quality, which makes it a good fit for interactive conversations. If you need lower latency, swap to our mini model with inworld-tts-1.5-mini.
speed: 1.0 controls playback speed. You can adjust this if your use case calls for faster or slower delivery.

There are many additional ways to customize the responses that you get from this step. Be sure to check out our docs to dive deeper.

Add the Initial Greeting

Now that the agent is setup, we'll prompt it to greet the user on start up. This GREET constant represents the first message we send to the agent.

const GREET = JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [{ type: 'input_text', text: 'You are an assistant from Inworld AI, the voice AI company. Briefly introduce yourself to the user' }]
  }
});

This sends a text message into the conversation as if a user typed it. The agent treats it as a prompt and responds accordingly. We send it as a conversation.item.create rather than putting it in the session instructions because it triggers a distinct response, while instructions shape ongoing behavior without producing an immediate reply.

Connect the Browser to Inworld

The final piece wires everything together. Here we start a WebSocket server to communicate between our browser and Inworlds servers.

After initiating the server, we send a message to Inworld to create the session with api.send(SESSION_CFG). This creates our voice agent.

Then, we send our initial greeting to prompt the voice agent to respond with api.send(GREET);

After sending GREET, we follow with response.create to tell Inworld to generate the agent's response. Without that explicit trigger, the conversation item would sit in context but the agent wouldn't speak. Together, these two events produce the voice greeting!

wss.on('connection', (browser) => {
  let setup = 0;

  const api = new WebSocket(
    `wss://api.inworld.ai/api/v1/realtime/session?key=voice-${Date.now()}&protocol=realtime`,
    { headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
  );

  api.on('message', (raw) => {
    const msg = JSON.parse(raw.toString());

    if (setup < 2) {
      if (msg.type === 'session.created') {
        api.send(SESSION_CFG);
        setup = 1;
      } else if (msg.type === 'session.updated' && setup === 1) {
        api.send(GREET);
        api.send(JSON.stringify({ type: 'response.create' }));
        setup = 2;
      }
    }

    if (browser.readyState === WebSocket.OPEN) {
      browser.send(raw.toString());
    }
  });

  browser.on('message', (msg) => {
    if (api.readyState === WebSocket.OPEN) api.send(msg.toString());
  });

  browser.on('close', () => api.close());
  api.on('close', () => {
    if (browser.readyState === WebSocket.OPEN) browser.close();
  });

  api.on('error', (e) => console.error('API error:', e.message));
});

server.listen(3000, () => console.log('Voice agent server ready on port 3000.'));

Step 2: Create the Frontend to Interact with the Voice Agent

Now that the backend is setup, we have to create a browser frontend to interact with it. The frontend is intentionally minimal. It's just a mock UI for testing the agent loop. Its responsibilities are:

Connect to the local WebSocket server at /ws
Request microphone access and capture audio using the Web Audio API
Convert Float32 samples to PCM16, 24 kHz mono, base64 encoded, and send each chunk as an input_audio_buffer.append event. The recommended chunk size is 100-200ms
Listen for response.output_audio.delta events, decode the base64 PCM16 payload, convert it back to Float32, and queue it for playback through an AudioContext
Handle input_audio_buffer.speech_started events to stop agent playback for clean barge-in behavior

<!doctype html>
<html>
  <head>
    <meta charset="utf-8" />
    <title>Voice Agent</title>
  </head>
  <body
    style="
      display: flex;
      align-items: center;
      justify-content: center;
      height: 100vh;
      margin: 0;
    "
  >
    <button id="btn" onclick="go()">Start Conversation</button>
    <script>
      const btn = document.getElementById("btn");
      let ws,
        ctx,
        src,
        proc,
        source,
        stream,
        active = false,
        playing = false,
        nextPlayTime = 0;
      const queue = [];

      async function go() {
        if (active) {
          ws.close();
          return;
        }
        btn.disabled = true;
        btn.textContent = "Connecting…";
        ctx = new AudioContext({ sampleRate: 24000 });
        stream = await navigator.mediaDevices.getUserMedia({
          audio: {
            sampleRate: 24000,
            channelCount: 1,
            echoCancellation: true,
            noiseSuppression: true,
          },
        });
        ws = new WebSocket(`ws://${location.host}/ws`);
        ws.onopen = () => {
          active = true;
          source = ctx.createMediaStreamSource(stream);
          proc = ctx.createScriptProcessor(2048, 1, 1);
          proc.onaudioprocess = ({ inputBuffer }) => {
            if (ws.readyState !== WebSocket.OPEN) return;
            const f = inputBuffer.getChannelData(0);
            const pcm = new Int16Array(f.length);
            for (let i = 0; i < f.length; i++)
              pcm[i] = Math.max(-32768, Math.min(32767, f[i] * 32768));
            ws.send(
              JSON.stringify({
                type: "input_audio_buffer.append",
                audio: b64(pcm.buffer),
              }),
            );
          };
          source.connect(proc);
          proc.connect(ctx.destination);
        };
        ws.onmessage = ({ data }) => {
          const e = JSON.parse(data);
          if (e.type === "response.output_audio.delta") {
            if (btn.disabled) {
              btn.textContent = "Stop Conversation";
              btn.disabled = false;
            }
            queue.push(
              Uint8Array.from(atob(e.delta), (c) => c.charCodeAt(0)).buffer,
            );
            if (!playing) playNext();
          } else if (e.type === "input_audio_buffer.speech_started") {
            stopAudio();
          }
        };
        ws.onclose = () => {
          active = false;
          stopAudio();
          proc?.disconnect();
          source?.disconnect();
          stream?.getTracks().forEach((t) => t.stop());
          btn.textContent = "Start Conversation";
          btn.disabled = false;
        };
      }

      function playNext() {
        if (!queue.length) {
          playing = false;
          return;
        }
        playing = true;
        const pcm16 = new Int16Array(queue.shift()),
          len = pcm16.length,
          fade = 48;
        const f32 = new Float32Array(len);
        for (let i = 0; i < len; i++) f32[i] = pcm16[i] / 32768;
        for (let i = 0; i < fade; i++) {
          f32[i] *= i / fade;
          f32[len - 1 - i] *= i / fade;
        }
        const buf = ctx.createBuffer(1, len, 24000);
        buf.getChannelData(0).set(f32);
        src = ctx.createBufferSource();
        src.buffer = buf;
        src.connect(ctx.destination);
        const t = Math.max(ctx.currentTime, nextPlayTime);
        nextPlayTime = t + buf.duration;
        src.onended = playNext;
        src.start(t);
      }

      function stopAudio() {
        queue.length = 0;
        playing = false;
        nextPlayTime = 0;
        try {
          src?.stop();
        } catch {}
        src = null;
      }

      function b64(buf) {
        const b = new Uint8Array(buf);
        let s = "";
        for (let i = 0; i < b.length; i++) s += String.fromCharCode(b[i]);
        return btoa(s);
      }
    </script>
  </body>
</html>

Step 3: Install and Run

Run the following command in your terminal to install WebSockets and start the Node server.

npm install ws
node server.js

Once the server is up, open your browser to the port the server logged (3000 by default). Click Start Conversation and allow microphone access. The agent should greet you with spoken audio within a few seconds. Speak back to confirm the full voice agent loop is working.

How Our Voice Agent Works

Our example app architecture has three layers.

The browser captures microphone audio and plays agent audio.
The Node.js server proxies WebSocket events and keeps the API key server-side.
The Inworld Realtime API handles speech recognition, model inference, TTS synthesis, turn detection, and interruption logic in one persistent session.

Key Events in the Flow

Event	Direction	Purpose
`session.created`	Server ← API	Confirms the WebSocket session is ready
`session.updated`	Server ← API	Confirms session config was applied
`input_audio_buffer.append`	Browser → API	Streams microphone audio chunks
`input_audio_buffer.speech_started`	Server ← API	VAD detected the user started speaking
`conversation.item.create`	Server → API	Sends a text message used for the greeting prompt
`response.create`	Server → API	Triggers model inference and audio generation
`response.output_audio.delta`	Server ← API	Streams synthesized audio chunks
`response.done`	Server ← API	Signals the response is complete
`response.function_call_arguments.done`	Server ← API	Delivers completed function call arguments for tool use

The semantic_vad turn detection mode is what makes the conversation feel natural. It waits for a semantically complete pause rather than a fixed silence threshold, then automatically triggers response.create when create_response is set to true. Combined with interrupt_response: true, the agent stops speaking when the user barges in.

Realtime TTS-2 (research preview) and Realtime TTS 1.5 Max produce expressive, steerable speech with realtime time-to-first-audio, which is what makes the agent's spoken replies feel natural and responsive.

Production Note: Authentication

This example uses Authorization: Basic on the server-side WebSocket connection. This approach is fine for local development and server-to-server communication where the API key never reaches the browser.

For browser-based production applications, the recommended pattern is Authorization: Bearer <jwt-token>, where the JWT is minted on your backend and passed to the client with a limited lifetime. Long-lived API credentials should never be exposed in client-side code.

What to Customize Next

Now that we have a working prototype, feel free to play around with the implementation to customize your voice AI app and explore the power of the Realtime API.

Change the Voice

Set audio.output.voice in the session.update payload. Inworld provides multiple built-in voices, including Clive, Olivia, and the current default Sarah.

Change the TTS Model

Swap audio.output.model between inworld-tts-1.5-mini for lower latency and inworld-tts-1.5-max for higher audio quality.

Change the underlying LLM

Set session.model in the session configuration to specify a different language model. Router support is also available for model routing, cohort handling, and A/B testing in production scenarios.

Add Tools and Function Calling

Register tools in the session.update payload using the tools array. This allows the LLM to execute tools when processing speech to add more capabilities to your voice agent.

Changing the voice, models, and tools allows for more customization of your agent. Inworld is great for production agents in consumer spaces like AI companions, education, and social apps, but as a developer it's also fun to play around with the voices on offer.

Why Inworld for Realtime Voice

Inworld AI builds leading models for text-to-speech. Realtime TTS-2 (research preview) is the #1 realtime TTS; it and Realtime TTS 1.5 Max deliver expressive, steerable speech with realtime latency, which is what makes agent replies sound natural. The Inworld stack also includes a model-agnostic Realtime API and Router across 220+ LLMs, so developers can compose the full voice pipeline with their LLM of choice.

The Realtime API example we walked through today took two minutes to setup and handles the complex steps of streaming responses between speech-to-text models, LLMs, and text-to-speech. The Realtime API operates across the full voice pipeline rather than handling one isolated step.

OpenAI has previously been the standard for realtime voice agents given its ease of use. The Inworld Realtime API event system is compatible with OpenAI-style realtime flows. This allows teams using the OpenAI Realtime API to migrate with minimal code changes, since event types, session configuration shapes, and client/server message structures remain consistent.

To get started with Inworld and unlock the frontier of voice AI, sign up for a free account here: https://platform.inworld.ai/login

Frequently Asked Questions

What is Inworld Realtime API?

Inworld Realtime API is a speech-to-speech API that takes streaming audio in and returns streaming audio out over a persistent WebSocket session. It handles turn taking, interruption, and voice output so you can build a natural voice agent without stitching together separate STT, LLM, and TTS services.

How easy is it to build a production voice agent?

Building a voice agent only takes a couple of minutes using Inworld's Realtime API. The Realtime API handles voice input, LLM processing, and voice output in one simple API. Time to production is incredibly fast with minimal orchestration required.

How do I reduce latency for my voice agent?

Start by using a low-latency TTS model such as inworld-tts-1.5-mini, keeping audio chunks around 100-200ms, avoiding heavy client-side processing, and deploying your proxy close to Inworld's API region.

How does barge-in (interrupting the agent) work?

When the API detects user speech, it emits input_audio_buffer.speech_started. In this demo the client stops playback immediately to prevent "talking over" the user. In production you typically combine this with server-side interruption handling so agent generation halts promptly.

Can I change the voice?

Yes. Set audio.output.voice in the session.update payload. Inworld offers multiple built-in voices and also supports custom voices. In this example, we used Clive, but you can try any of the other voices we offer.

Can I change the underlying LLM?

Yes. Set session.model in the session configuration to specify a different language model. You can also use Inworld's Router for routing, cohort handling, and experiments.

Can I add tools or function calling?

Yes. Register tools in the tools array in your session.update payload, then handle tool calls when the API sends completed tool arguments events.

Is this code production-ready?

It's intended as a minimal, copy-pasteable starting point. For production, you should add a JWT auth flow, robust reconnect and error handling, rate limiting and abuse protection, logging with metrics and tracing, and a UI for transcript, states, and failures.

Where can I learn more about event types and advanced configuration?

Start with the Inworld Realtime docs, including the WebSocket guide and the "Using realtime models" section for session configuration, tools, and streaming event details.

How to Build an AI Voice Agent: 2-Minute Example Using Inworld AI