Building conversational voice agents with Mastra + Inworld Realtime API

@mastra/voice-inworld-realtime collapses STT, LLM, and expressive, realtime TTS into one WebSocket, with semantic VAD, barge-in, and tool calling out of the box.

Most voice agents are cascaded pipelines: mic → STT → LLM → TTS → speaker. Each box is a different vendor, each hop adds latency, and the audio channel goes silent every time a tool call fires. The user notices.

Inworld's Realtime API folds the cascade into a single WebSocket session. The new @mastra/voice-inworld-realtime package binds that session to a Mastra Agent: semantic VAD ends the user's turn when they actually stop, barge-in cuts the assistant mid-sentence, and the LLM calls your Mastra tools without breaking the audio stream.

This reference CLI does all of this in less than 100 lines of TypeScript, and our example voice design agent shows how this can be integrated into a real application.

Watch the demo here.

One WebSocket, one session

Inworld's Realtime API takes the STT/LLM/TTS cascade and ships it as a single session over WebSocket. You pick an LLM (OpenAI, Anthropic, Google, and others; the routing string is provider-agnostic), pick an Inworld voice, and the pipeline runs end-to-end on Inworld's side. On the wire it looks like this:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    type: "realtime",
    modelId: "anthropic/claude-sonnet-4-6",
    audio: {
      output: { model: "inworld-tts-2", voice: "Sarah" }
    }
  }
}));

Swap providers without touching your integration. Choose your STT engine independently of the LLM. And because the LLM emits Inworld's [steering] tags inline, like [whisper softly like you're sharing a secret], the TTS renders prosody and non-verbal cues in the same response stream, with no second model call and no SSML preprocessor.

For a Mastra developer, the interesting question is: how do you bind that to an Agent?

The voice package

@mastra/voice-inworld-realtime implements Mastra's MastraVoice interface against the Realtime API. It registers as an Agent's voice, surfaces the realtime stream as a typed event emitter, and routes the LLM's tool calls back to your Mastra createTool definitions.

The minimum config:

import { InworldRealtimeVoice } from '@mastra/voice-inworld-realtime';

const voice = new InworldRealtimeVoice({
  model: 'openai/gpt-5.4-nano',
  speaker: 'Jason',
});

model is provider/model, the same routing string the Realtime API uses on the wire. speaker is any voice in Inworld's built-in library, or your cloned custom voices. Attach it to an Agent the same way you'd attach any other Mastra voice:

new Agent({
  id: 'voice-demo',
  name: 'Voice Demo',
  instructions: 'You are a concise voice assistant. Reply in one or two short sentences. Use the get-current-time tool when asked the time.',
  tools: { getCurrentTime },
  voice,
});

Tool calls happen mid-conversation

Tool definitions look like any other Mastra tool:

const getCurrentTime = createTool({
  id: 'get-current-time',
  description: 'Returns the current local time.',
  inputSchema: z.object({}),
  outputSchema: z.object({ time: z.string() }),
  execute: async () => ({ time: new Date().toLocaleTimeString() }),
});

When the user asks for the time, the LLM inside the Realtime session emits a tool call, the voice package routes it to execute(), and the result streams back into the session as a tool result. The audio channel stays open the whole time. No "let me look that up for you" stall while a cascaded pipeline tears down and rebuilds around the tool call.

The same contract works for anything you'd register on a non-voice agent: an HTTP call, a database query, an MCP tool. Mastra's tool surface doesn't change because the voice is realtime.

Semantic VAD, barge-in, and the interrupted event

Two things voice agents get wrong: they keep talking when the user starts talking, and they cut the user off because they think a pause means the turn is over.

Inworld's semantic VAD targets both. It detects intent boundaries, "did this user actually finish their thought?", rather than silence thresholds, and the eagerness is tunable on the session config. The default settings give natural conversational cadence; you can lean cautious or aggressive depending on your domain.

Barge-in is wired through the interrupted event:

voice.on('interrupted', ({ response_id }) =>
  players.get(response_id)?.kill('SIGTERM')
);

When the user starts speaking over the assistant, Inworld emits interrupted with the response ID of the assistant turn that should stop. The demo kills the matching play process; in a browser, you'd cancel the AudioBufferSourceNode for that response.

The combination is what makes the conversation feel live: the assistant yields when interrupted, the user's turn ends when they actually finish, and tool calls don't break either side of that contract.

Streaming transcript

For debugging or for a UI transcript pane, the writing event streams text from both sides:

voice.on('writing', ({ text, role }) => {
  // role is 'user' | 'assistant'
});

User text comes from STT, assistant text from the LLM, both incremental. tool-call-start and error round out the event surface:

voice.on('tool-call-start', ({ toolName }) => console.log(`\n[tool] ${toolName}`));
voice.on('error', err => console.error('\n[error]', err));

tool-call-start fires before execute() runs; error surfaces anything the WebSocket or the pipeline raises. Same shape as the rest of the surface, just typed events, no string parsing.

Start building

A Mastra Agent that can have an engaging conversation. The Realtime API handles the speech-to-speech cascade and @mastra/voice-inworld-realtime translates between that pipeline and Mastra's Agent and tool primitives.

From here, the places to extend are the usual Mastra ones. Add memory so the agent carries context across sessions. Add scorers and evals to grade voice behavior over time. Add more tools. Anything that fits createTool works. Swap the LLM by changing one string. Swap the voice the same way.

Learn more in Inworld's Realtime API docs. For the agent layer, see the Mastra docs.