Get started
Published 04.02.2026

Build a Voice Agent in 30 Minutes with Inworld AI

Last updated: April 5, 2026
How do you build a voice agent? A voice agent takes audio input from a microphone, converts speech to text, sends that text to a language model for processing, converts the model's response back to speech, and plays it to the user. Inworld AI's Realtime API collapses this entire pipeline into a single WebSocket connection. You send raw audio in, you get synthesized audio back. This tutorial walks through building a working voice agent from scratch in 30 minutes using Node.js and a browser.
The Realtime API handles STT, LLM routing, and TTS in a single WebSocket connection. Instead of stitching together a speech-to-text service, a language model, and a text-to-speech service with your own orchestration layer, you open one connection and the API manages the full voice pipeline. Model-agnostic by design. Route to any LLM, use any TTS voice.
This tutorial builds a complete voice agent with two files: a Node.js server that proxies WebSocket events (keeping your API key server-side), and a browser client that captures microphone audio and plays the agent's spoken responses. By the end, you will have a working agent you can talk to.

What do you need before starting?

Three things:
  1. Node.js 18+ installed on your machine
  2. An Inworld AI account at platform.inworld.ai
  3. An API key from the Inworld Portal
Set your API key as an environment variable. Every code example in this tutorial reads from it:
export INWORLD_API_KEY=your_key_here

How does the Realtime API compare to building it yourself?

If you build a voice agent from separate services, you need to orchestrate three API calls per turn, manage audio format conversions between them, build your own turn detection, and handle interruptions. The Realtime API removes all of that.

What are we building?

The voice agent has three layers:
  1. Browser client captures microphone audio at 24kHz mono PCM16, streams it to the server, and plays back audio chunks from the agent in real time.
  2. Node.js server acts as a WebSocket proxy. It authenticates with the Inworld Realtime API using your API key and relays messages between the browser and the API. The API key never reaches the client.
  3. Inworld Realtime API receives the raw audio, runs speech recognition, sends the transcript to the configured LLM, generates a spoken response using the configured TTS voice, and streams audio chunks back. It also handles turn detection and interruption logic.
Create a project directory and initialize it:
mkdir voice-agent && cd voice-agent

How do you set up the Node.js server?

Start with the imports and HTTP scaffolding. The server does two things: serves the HTML client on HTTP requests, and opens a WebSocket path at /ws for the voice connection.
import { readFileSync } from 'fs';
import { createServer } from 'http';
import { WebSocketServer, WebSocket } from 'ws';

const PORT = parseInt(process.env.PORT || '3000', 10);
const INWORLD_API_KEY = process.env.INWORLD_API_KEY;

if (!INWORLD_API_KEY) {
  console.error('Missing INWORLD_API_KEY environment variable.');
  process.exit(1);
}

const html = readFileSync('index.html');

const server = createServer((req, res) => {
  res.writeHead(200, { 'Content-Type': 'text/html' });
  res.end(html);
});

const wss = new WebSocketServer({ server, path: '/ws' });
The port comes from the PORT environment variable with a fallback to 3000. The server reads index.html into memory once at startup and serves it for every HTTP request. The WebSocket server listens on /ws, which is where the browser client will connect.

How do you configure the voice agent?

The session configuration tells the Realtime API how the agent should behave. This is where you set the system prompt, the TTS voice, and the TTS model.
const SESSION_CONFIG = JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    instructions:
      'You are a helpful voice assistant. You speak clearly and concisely. When the user asks a question, give a direct answer first, then offer to go deeper if they want.',
    output_modalities: ['audio', 'text'],
    temperature: 0.7,
    audio: {
      output: {
        voice: 'Sarah',
        model: 'inworld-tts-1.5-max',
        speed: 1.0,
      },
    },
  },
});
Important note on the Realtime API vs the REST API. The Realtime WebSocket API uses voice and model inside session.audio.output. The REST TTS API uses voiceId and modelId. They are different parameters for different protocols. Do not mix them.
Here is what each field does:
  • type: 'realtime' tells Inworld you want a persistent bidirectional audio session, not a one-shot REST call.
  • instructions is the system prompt. It shapes personality, constraints, and behavior. Replace it with whatever fits your use case.
  • output_modalities: ['audio', 'text'] means the agent produces both spoken audio and a text transcript. The text is useful for displaying captions or debugging. Drop 'text' if you only need audio.
  • temperature: 0.7 controls response variation. Lower values produce more deterministic output.
  • voice: 'Sarah' selects the TTS voice. The voice library includes built-in and custom cloned voices.
  • model: 'inworld-tts-1.5-max' selects the TTS model. The max variant optimizes for quality. For lower latency, use inworld-tts-1.5-mini.

How do you trigger the first greeting?

When a user connects, you want the agent to speak first. Send a text message into the conversation as if a user typed it, then tell the API to generate a response.
const GREETING = JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [
      {
        type: 'input_text',
        text: 'Greet the user briefly. Tell them you are ready to help with whatever they need.',
      },
    ],
  },
});
This uses conversation.item.create to add a message to the conversation context. The agent treats it as a prompt. We send it separately from the session instructions because it triggers an immediate response, while instructions shape ongoing behavior without producing a reply on their own.

How do you wire the server to the Realtime API?

This is the core of the server. When a browser connects to /ws, the server opens a WebSocket to the Inworld Realtime API and relays messages in both directions.
wss.on('connection', (browser) => {
  let phase = 0;

  const api = new WebSocket(
    `wss://api.inworld.ai/api/v1/realtime/session?key=voice-${Date.now()}&protocol=realtime`,
    { headers: { Authorization: `Basic ${INWORLD_API_KEY}` } }
  );

  api.on('open', () => {
    console.log('Connected to Inworld Realtime API');
  });

  api.on('message', (raw) => {
    const msg = JSON.parse(raw.toString());

    if (phase === 0 && msg.type === 'session.created') {
      api.send(SESSION_CONFIG);
      phase = 1;
    } else if (phase === 1 && msg.type === 'session.updated') {
      api.send(GREETING);
      api.send(JSON.stringify({ type: 'response.create' }));
      phase = 2;
    }

    // Forward every API message to the browser
    if (browser.readyState === WebSocket.OPEN) {
      browser.send(raw.toString());
    }
  });

  // Forward every browser message to the API
  browser.on('message', (msg) => {
    if (api.readyState === WebSocket.OPEN) api.send(msg.toString());
  });

  browser.on('close', () => api.close());
  api.on('close', () => {
    if (browser.readyState === WebSocket.OPEN) browser.close();
  });
  api.on('error', (e) => console.error('Inworld API error:', e.message));
});

server.listen(PORT, () =>
  console.log(`Voice agent running at http://localhost:${PORT}`)
);
The connection flow works in three phases:
  1. Phase 0: The server connects to wss://api.inworld.ai/api/v1/realtime/session?protocol=realtime with Basic auth. When the API responds with session.created, the server sends the session configuration.
  2. Phase 1: After the API confirms the configuration with session.updated, the server sends the greeting message and triggers response.create to start audio generation.
  3. Phase 2: Steady state. Every message from the API is forwarded to the browser. Every message from the browser is forwarded to the API. The server is a transparent proxy.
The response.create event is required. Without it, the conversation item sits in context but the agent does not generate a response. Think of conversation.item.create as adding to the context, and response.create as saying "now respond to everything in context."

How do you build the browser client?

The browser client handles three jobs: capturing microphone audio, streaming it to the server, and playing back the agent's audio response. It also handles barge-in by stopping playback when the user starts speaking.
<!doctype html>
<html>
<head>
  <meta charset="utf-8" />
  <title>Voice Agent</title>
  <style>
    * { margin: 0; box-sizing: border-box; }
    body {
      font-family: system-ui, -apple-system, sans-serif;
      display: flex; flex-direction: column;
      align-items: center; justify-content: center;
      height: 100vh; background: #0a0a0a; color: #e5e5e5;
    }
    #status { margin-bottom: 24px; font-size: 14px; color: #888; }
    #btn {
      padding: 16px 32px; font-size: 16px; font-weight: 600;
      border: 1px solid #333; border-radius: 8px;
      background: #1a1a1a; color: #e5e5e5; cursor: pointer;
      transition: background 0.15s;
    }
    #btn:hover { background: #262626; }
    #btn:disabled { opacity: 0.4; cursor: default; }
    #transcript {
      margin-top: 32px; max-width: 480px; width: 100%;
      font-size: 14px; color: #aaa; text-align: center;
      min-height: 48px;
    }
  </style>
</head>
<body>
  <div id="status">Ready</div>
  <button id="btn" onclick="toggle()">Start Conversation</button>
  <div id="transcript"></div>

  <script>
    const btn = document.getElementById('btn');
    const status = document.getElementById('status');
    const transcript = document.getElementById('transcript');

    let ws, ctx, proc, source, stream, src;
    let active = false, playing = false, nextPlayTime = 0;
    const queue = [];
    let partialText = '';

    async function toggle() {
      if (active) { ws.close(); return; }

      btn.disabled = true;
      status.textContent = 'Connecting...';

      ctx = new AudioContext({ sampleRate: 24000 });
      stream = await navigator.mediaDevices.getUserMedia({
        audio: {
          sampleRate: 24000,
          channelCount: 1,
          echoCancellation: true,
          noiseSuppression: true,
        },
      });

      ws = new WebSocket(`ws://${location.host}/ws`);

      ws.onopen = () => {
        active = true;
        status.textContent = 'Connected. Waiting for agent...';

        source = ctx.createMediaStreamSource(stream);
        proc = ctx.createScriptProcessor(2048, 1, 1);

        proc.onaudioprocess = ({ inputBuffer }) => {
          if (ws.readyState !== WebSocket.OPEN) return;
          const f = inputBuffer.getChannelData(0);
          const pcm = new Int16Array(f.length);
          for (let i = 0; i < f.length; i++)
            pcm[i] = Math.max(-32768, Math.min(32767, f[i] * 32768));
          ws.send(JSON.stringify({
            type: 'input_audio_buffer.append',
            audio: b64(pcm.buffer),
          }));
        };

        source.connect(proc);
        proc.connect(ctx.destination);
      };

      ws.onmessage = ({ data }) => {
        const e = JSON.parse(data);

        if (e.type === 'response.output_audio.delta') {
          if (btn.disabled) {
            btn.textContent = 'Stop Conversation';
            btn.disabled = false;
            status.textContent = 'Agent speaking...';
          }
          queue.push(
            Uint8Array.from(atob(e.delta), c => c.charCodeAt(0)).buffer
          );
          if (!playing) playNext();
        }

        else if (e.type === 'response.output_text.delta') {
          partialText += e.delta;
          transcript.textContent = partialText;
        }

        else if (e.type === 'response.output_text.done') {
          partialText = '';
        }

        else if (e.type === 'input_audio_buffer.speech_started') {
          stopAudio();
          status.textContent = 'Listening...';
        }

        else if (e.type === 'response.done') {
          status.textContent = 'Your turn. Speak anytime.';
        }
      };

      ws.onclose = () => {
        active = false;
        stopAudio();
        proc?.disconnect();
        source?.disconnect();
        stream?.getTracks().forEach(t => t.stop());
        btn.textContent = 'Start Conversation';
        btn.disabled = false;
        status.textContent = 'Disconnected.';
        transcript.textContent = '';
      };
    }

    function playNext() {
      if (!queue.length) { playing = false; return; }
      playing = true;
      const pcm16 = new Int16Array(queue.shift());
      const len = pcm16.length, fade = 48;
      const f32 = new Float32Array(len);
      for (let i = 0; i < len; i++) f32[i] = pcm16[i] / 32768;
      for (let i = 0; i < fade; i++) {
        f32[i] *= i / fade;
        f32[len - 1 - i] *= i / fade;
      }
      const buf = ctx.createBuffer(1, len, 24000);
      buf.getChannelData(0).set(f32);
      src = ctx.createBufferSource();
      src.buffer = buf;
      src.connect(ctx.destination);
      const t = Math.max(ctx.currentTime, nextPlayTime);
      nextPlayTime = t + buf.duration;
      src.onended = playNext;
      src.start(t);
    }

    function stopAudio() {
      queue.length = 0;
      playing = false;
      nextPlayTime = 0;
      try { src?.stop(); } catch {}
      src = null;
    }

    function b64(buf) {
      const b = new Uint8Array(buf);
      let s = '';
      for (let i = 0; i < b.length; i++) s += String.fromCharCode(b[i]);
      return btoa(s);
    }
  </script>
</body>
</html>
The audio pipeline works like this:
  1. Capture: The browser requests microphone access at 24kHz, mono, with echo cancellation and noise suppression enabled. A ScriptProcessor node reads the raw Float32 samples.
  2. Encode and send: Each audio frame is converted from Float32 to PCM16 (Int16Array), base64-encoded, and sent as an input_audio_buffer.append event over the WebSocket.
  3. Receive and decode: When the server forwards a response.output_audio.delta event, the client decodes the base64 payload back to PCM16, converts to Float32, and pushes it into a playback queue.
  4. Playback: The playNext function pulls from the queue, creates an AudioBuffer, applies a short fade on chunk boundaries to prevent clicks, and schedules playback using AudioContext timing.
  5. Barge-in: When an input_audio_buffer.speech_started event arrives, the client immediately stops playback, clears the queue, and resets timing. This prevents the agent from talking over the user.
  6. Transcript: The client also listens for response.output_text.delta events and renders partial text as it arrives, giving the user a live transcript.

How do you create the package.json?

You need one dependency: the ws WebSocket library.
{
  "name": "voice-agent",
  "type": "module",
  "scripts": {
    "start": "node server.js"
  },
  "dependencies": {
    "ws": "^8.0.0"
  }
}

How do you run it?

Install the dependency, set your API key, and start the server:
npm install
export INWORLD_API_KEY=your_key_here
npm start
Open the URL printed in the terminal in your browser. Click Start Conversation and allow microphone access. The agent will greet you with spoken audio within a few seconds. Speak back to confirm the full loop is working.
If you set a custom port with export PORT=8080, the server will start on that port instead.

What does the complete server file look like?

For reference, here is the full server.js in one block. This is the same code from the steps above, combined into a single file you can copy directly.
import { readFileSync } from 'fs';
import { createServer } from 'http';
import { WebSocketServer, WebSocket } from 'ws';

const PORT = parseInt(process.env.PORT || '3000', 10);
const INWORLD_API_KEY = process.env.INWORLD_API_KEY;

if (!INWORLD_API_KEY) {
  console.error('Missing INWORLD_API_KEY environment variable.');
  process.exit(1);
}

const html = readFileSync('index.html');

const server = createServer((req, res) => {
  res.writeHead(200, { 'Content-Type': 'text/html' });
  res.end(html);
});

const wss = new WebSocketServer({ server, path: '/ws' });

const SESSION_CONFIG = JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    instructions:
      'You are a helpful voice assistant. You speak clearly and concisely. When the user asks a question, give a direct answer first, then offer to go deeper if they want.',
    output_modalities: ['audio', 'text'],
    temperature: 0.7,
    audio: {
      output: {
        voice: 'Sarah',
        model: 'inworld-tts-1.5-max',
        speed: 1.0,
      },
    },
  },
});

const GREETING = JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [
      {
        type: 'input_text',
        text: 'Greet the user briefly. Tell them you are ready to help with whatever they need.',
      },
    ],
  },
});

wss.on('connection', (browser) => {
  let phase = 0;

  const api = new WebSocket(
    `wss://api.inworld.ai/api/v1/realtime/session?key=voice-${Date.now()}&protocol=realtime`,
    { headers: { Authorization: `Basic ${INWORLD_API_KEY}` } }
  );

  api.on('open', () => {
    console.log('Connected to Inworld Realtime API');
  });

  api.on('message', (raw) => {
    const msg = JSON.parse(raw.toString());

    if (phase === 0 && msg.type === 'session.created') {
      api.send(SESSION_CONFIG);
      phase = 1;
    } else if (phase === 1 && msg.type === 'session.updated') {
      api.send(GREETING);
      api.send(JSON.stringify({ type: 'response.create' }));
      phase = 2;
    }

    if (browser.readyState === WebSocket.OPEN) {
      browser.send(raw.toString());
    }
  });

  browser.on('message', (msg) => {
    if (api.readyState === WebSocket.OPEN) api.send(msg.toString());
  });

  browser.on('close', () => api.close());
  api.on('close', () => {
    if (browser.readyState === WebSocket.OPEN) browser.close();
  });
  api.on('error', (e) => console.error('Inworld API error:', e.message));
});

server.listen(PORT, () =>
  console.log(`Voice agent running at http://localhost:${PORT}`)
);

How do you change the LLM?

Set session.model in the session configuration to specify a different language model. The Realtime API routes to the model you choose while keeping TTS and STT unchanged.
const SESSION_CONFIG = JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    instructions: 'You are a helpful assistant.',
    model: 'gpt-5.4',
    output_modalities: ['audio', 'text'],
    audio: {
      output: {
        voice: 'Sarah',
        model: 'inworld-tts-1.5-max',
      },
    },
  },
});
You can also use Inworld's LLM Router for automatic model selection, fallback routing, and A/B testing across 200+ models.

How do you add tools and function calling?

Voice agents become much more useful when they can take actions. Register tools in the tools array inside session.update, and the LLM will call them when relevant.
const SESSION_WITH_TOOLS = JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    instructions: 'You are a weather assistant. Use the get_weather tool when the user asks about weather.',
    output_modalities: ['audio', 'text'],
    temperature: 0.7,
    audio: {
      output: {
        voice: 'Sarah',
        model: 'inworld-tts-1.5-max',
      },
    },
    tools: [
      {
        type: 'function',
        name: 'get_weather',
        description: 'Get the current weather for a given city',
        parameters: {
          type: 'object',
          properties: {
            city: { type: 'string', description: 'City name' },
          },
          required: ['city'],
        },
      },
    ],
  },
});
When the LLM decides to call a tool, the API sends a response.function_call_arguments.done event. Handle it on the server, execute the tool, and send the result back:
api.on('message', (raw) => {
  const msg = JSON.parse(raw.toString());

  // ... existing session setup logic ...

  if (msg.type === 'response.function_call_arguments.done') {
    const args = JSON.parse(msg.arguments);
    console.log(`Tool call: ${msg.name}`, args);

    // Execute the tool and send the result back
    const result = { temperature: 72, condition: 'sunny' }; // Your real logic here

    api.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'function_call_output',
        call_id: msg.call_id,
        output: JSON.stringify(result),
      },
    }));

    // Trigger the agent to respond with the tool result
    api.send(JSON.stringify({ type: 'response.create' }));
  }

  if (browser.readyState === WebSocket.OPEN) {
    browser.send(raw.toString());
  }
});
The agent will speak the tool result to the user as part of its response. This pattern works for any external API: weather, databases, booking systems, CRM lookups, or anything else you can call from Node.js.

What should you change for production?

The tutorial code is a working starting point. For production, address these areas:
  1. Authentication. Replace Basic auth with JWT tokens. Mint short-lived JWTs on your backend and pass them to the client. Never expose your API key in client-side code. The server-side proxy in this tutorial already keeps the key off the client, but JWT auth adds an additional layer for browser-to-server auth.
  2. Reconnection. WebSocket connections drop. Add automatic reconnection with exponential backoff on both the browser-to-server and server-to-API connections.
  3. Error handling. Catch and surface errors from the API. Display connection state to the user. Log errors server-side with enough context to debug.
  4. Rate limiting. Protect your server from abuse. Limit connections per IP and messages per second.
  5. UI. Replace the minimal HTML with a real interface. Show a live transcript, connection state indicators, and a visual indicator when the agent is speaking or listening.

What are the key WebSocket events?

Frequently Asked Questions

What is a voice agent?
A voice agent is software that listens to spoken input, processes it through a language model, and responds with synthesized speech. Unlike chatbots that operate on text, voice agents handle the full audio loop: speech recognition, reasoning, and speech generation.
How does the Inworld Realtime API differ from stitching together STT, LLM, and TTS?
The Realtime API handles all three stages over a single WebSocket connection. You send raw audio in, and get synthesized audio back. No intermediate transcription step to manage, no separate TTS calls to orchestrate, no turn-detection logic to build yourself.
What languages does the Realtime API support?
Inworld TTS supports 15 languages with native-quality pronunciation. The speech recognition component handles major languages automatically. Check the docs for the current supported language list.
Can I use a different LLM with the Realtime API?
Yes. Set session.model in the session.update payload to specify any supported model. You can also use Inworld's LLM Router for automatic model selection, A/B testing, and fallback routing across 200+ models.
Can I use a custom or cloned voice?
Yes. Inworld supports zero-shot voice cloning from 5-15 seconds of reference audio. You can use any cloned voice with the Realtime API by setting the voice name in session.audio.output.voice.
How does barge-in work?
The Realtime API uses voice activity detection (VAD) to detect when the user starts speaking. It emits an input_audio_buffer.speech_started event. Your client should stop audio playback immediately to prevent the agent from talking over the user.
Is this tutorial production-ready?
It is a complete, working starting point. For production, add JWT-based authentication (instead of Basic auth with the API key on the server), reconnection logic, rate limiting, error handling with retries, and a proper UI with transcript display.
What is the latency of the Realtime API?
With inworld-tts-1.5-mini, expect sub-130ms P90 TTS latency. End-to-end voice-in to voice-out latency depends on the LLM and network conditions, but the single-connection architecture eliminates the inter-service overhead you get when stitching three APIs together.
Copyright © 2021-2026 Inworld AI