Conversational AI

Build AI that holds a real conversation

Q: Is this a real pipeline, or just a TTS wrapper with an LLM slapped on?

Yes. The Realtime API exposes STT, Router, and TTS as a single connection, so you don't have to stitch together three vendor contracts or manage the glue code between them. Swap any layer with a one-field change in the session configuration.

Q: Which language models can I use?

Any LLM through the Inworld Router, including OpenAI GPT, Anthropic Claude, Google Gemini, Meta Llama, Mistral, and hundreds more. Switch by changing one field on the session configuration. See pricing for current rates.

Q: What does the voice profile do?

Realtime STT-1 produces a voice profile (emotion, age, accent, pitch, and vocal style with confidence scores) that flows into the Router context automatically. The LLM reasons with the signal, and Router can emit TTS steering instructions that adapt the voice output to the user's emotional state. This cross-layer context is the reason to unify the pipeline.

Q: How fast is the end-to-end pipeline?

Median first audio in under a second end-to-end (STT → LLM → TTS). TTS alone hits under 200ms; the remainder is your chosen LLM's reasoning time. Router lets you swap to a faster LLM when latency matters more than reasoning depth.

Q: Can I deploy this on-premise?

WebSocket, WebRTC, and SIP on the same endpoint, with on-premise deployment available for regulated environments. GDPR and SOC 2 Type II compliant. Zero data retention on TTS. Contact sales for on-premise or high-volume deployments.

Q: Does it handle interruptions?

Yes. Semantic VAD detects when the user starts speaking and cancels the in-flight response. The audio connection stays open, so the model can resume or pivot without closing the session. Barge-in works out of the box, no custom logic required.

Q: Is this OpenAI Realtime compatible?

Yes. The Realtime API follows the OpenAI Realtime protocol: same event names, same message flow. If you built on gpt-realtime, swap the base URL and keep your code. See the migration guide.

Q: Is this generally available?

The Realtime API is in research preview. WebSocket is generally available; WebRTC and SIP are in early access. Reach out for early-access or on-premise deployment.

Speech-to-text, any LLM, and the #1 realtime text-to-speech behind a single API. Voice profiling flows through to the model so the AI responds to how your users sound, not just what they said.

Start building See the pipeline

Live pipeline

User

stt inworld/inworld-stt-1voice_profile.emotion stressed

I'm so behind on everything and I don't even know where to start.

Agent

session.model anthropic/claude-sonnet-4-6voice Sarahtts.steering calm, slower pace

That's completely valid. Let's take a breath first, then we'll pick one thing together.

Works with

Realtime API Router TTS STT

Every layer, best-in-class, one contract.

Your agent hears the tone, reasons on the model you pick, and answers in the voice users vote best.

One pipeline, not a stitched stack

Works with

Realtime API

Stitched is someone else's problem. Unified is yours to ship.

Four vendor contracts and a month of glue code, replaced by one persistent connection. STT, Router, and TTS under one API. Swap any layer, never rewrite the stack.

How conversational AI is built

Stitched stack

Vendor A (STT)

Vendor B (LLM)

Vendor C (TTS)

+ glue code

four contracts

One pipeline

Inworld Realtime API

STT • Router • TTS

one endpoint, one key

one contract

One pipeline, not a stitched stack

Works with

Realtime API

Stitched is someone else's problem. Unified is yours to ship.

Four vendor contracts and a month of glue code, replaced by one persistent connection. STT, Router, and TTS under one API. Swap any layer, never rewrite the stack.

How conversational AI is built

Stitched stack

Vendor A (STT)

Vendor B (LLM)

Vendor C (TTS)

+ glue code

four contracts

One pipeline

Inworld Realtime API

STT • Router • TTS

one endpoint, one key

one contract

Best-in-class at every layer

Works with

Realtime API

TTS

Best-in-class at every layer, without the integration tax.

Realtime TTS ranks #1 on Artificial Analysis with three of the top five voices. Router carries hundreds of LLMs. STT hears who's speaking and how they feel. No other pipeline stacks best-in-class at every layer.

Voice · #1 on Artificial Analysis

Realtime TTS 1.5 Max

Next best

Realtime TTS 1 Max

Realtime TTS 1.5 Mini

3 of top 5 · voice profile · cascade + speech-to-LLM

Voice · #1 on Artificial Analysis

Realtime TTS 1.5 Max

Next best

Realtime TTS 1 Max

Realtime TTS 1.5 Mini

3 of top 5 · voice profile · cascade + speech-to-LLM

Best-in-class at every layer

Works with

Realtime API

TTS

Best-in-class at every layer, without the integration tax.

Voice profile flows through

Works with

Context that carries across every layer.

STT captures emotion, age, and vocal style. Router injects that context into the LLM. TTS adapts tone to match. Stitched stacks lose the signal at every handoff.

Voice profile flows through

STT

profile detected

emotion: stressed · age: 30s

Router

context injected

"user sounds stressed"

TTS

voice adapts

softer, slower, [sigh]

Cross-layer context carries through the pipeline. Stitched stacks can't do this.

Voice profile flows through

Works with

Context that carries across every layer.

STT captures emotion, age, and vocal style. Router injects that context into the LLM. TTS adapts tone to match. Stitched stacks lose the signal at every handoff.

Voice profile flows through

STT

profile detected

emotion: stressed · age: 30s

Router

context injected

"user sounds stressed"

TTS

voice adapts

softer, slower, [sigh]

Cross-layer context carries through the pipeline. Stitched stacks can't do this.

Pick every layer

Works with

Realtime API

Router

Best component, every layer, any time.

Route across Realtime STT, Whisper, or AssemblyAI. Reach GPT, Claude, Gemini, Llama, Mistral, and hundreds more through one endpoint. Pair with Realtime TTS 1.5 Max or Mini. One field changes the stack.

Explore the Router

Pick every layer

session.update

STT

Realtime STT-1

Whisper

AssemblyAI

Router / LLM

GPT

Claude

Gemini

Llama

Mistral

… hundreds more

TTS

Inworld 1.5 Max

Inworld 1.5 Mini

Pick every layer

session.update

STT

Realtime STT-1

Whisper

AssemblyAI

Router / LLM

GPT

Claude

Gemini

Llama

Mistral

… hundreds more

TTS

Inworld 1.5 Max

Inworld 1.5 Mini

Pick every layer

Works with

Realtime API

Router

Best component, every layer, any time.

Explore the Router

Sub-second end-to-end

Works with

Realtime API

Speech in, speech out, under a second.

Full-duplex streaming across the whole pipeline. STT detects the pause, Router routes the thought, TTS starts speaking before the sentence finishes forming. Human response time, end to end.

End-to-end · full pipeline

<1s

Speech in, speech out

STT → Router → TTS on one persistent connection

Sub-second end-to-end

Works with

Realtime API

Speech in, speech out, under a second.

Full-duplex streaming across the whole pipeline. STT detects the pause, Router routes the thought, TTS starts speaking before the sentence finishes forming. Human response time, end to end.

End-to-end · full pipeline

<1s

Speech in, speech out

STT → Router → TTS on one persistent connection

Deploy anywhere

Works with

Realtime API

Every surface users call from.

WebSocket for servers, WebRTC for browsers, SIP for call centers, on-premise for regulated data. Same API, same voice, same LLM choice. GDPR and SOC 2 Type II compliant.

Deploy anywhere

WebSocket

servers + back end

WebRTC

browser, no codec plumbing

SIP

phone + call center

On-prem

your data center

GDPR and SOC 2 Type II compliant. Zero data retention on TTS.

Deploy anywhere

WebSocket

servers + back end

WebRTC

browser, no codec plumbing

SIP

phone + call center

On-prem

your data center

GDPR and SOC 2 Type II compliant. Zero data retention on TTS.

Deploy anywhere

Works with

Realtime API

Every surface users call from.

WebSocket for servers, WebRTC for browsers, SIP for call centers, on-premise for regulated data. Same API, same voice, same LLM choice. GDPR and SOC 2 Type II compliant.

The whole pipeline, in forty lines

Connect once, configure every layer, stream audio. The handoffs disappear so your AI just listens and answers.

import WebSocket from 'ws';

const ws = new WebSocket(
  `wss://api.inworld.ai/api/v1/realtime/session?key=voice-${Date.now()}&protocol=realtime`,
  { headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
);

ws.on('message', (data) => {
  const msg = JSON.parse(data.toString());

  if (msg.type === 'session.created') {
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        type: 'realtime',
        model: 'anthropic/claude-sonnet-4-6',        // any Router model
        instructions: 'You are a warm conversational assistant.',
        output_modalities: ['audio', 'text'],
        audio: {
          input: {
            transcription: { model: 'inworld/inworld-stt-1' },
            turn_detection: {
              type: 'semantic_vad',
              eagerness: 'medium',
              create_response: true,
              interrupt_response: true,
            },
          },
          output: {
            model: 'inworld-tts-2',
            voice: 'Sarah',
          },
        },
      },
    }));
  }

  if (msg.type === 'response.output_audio.delta') {
    audioQueue.push(base64ToPcm16(msg.delta));
    if (!isPlaying) playNext();
  }
});

mic.on('data', (chunk) => {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: chunk.toString('base64'),
  }));
});

import WebSocket from 'ws';

const ws = new WebSocket(
  `wss://api.inworld.ai/api/v1/realtime/session?key=voice-${Date.now()}&protocol=realtime`,
  { headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
);

ws.on('message', (data) => {
  const msg = JSON.parse(data.toString());

  if (msg.type === 'session.created') {
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        type: 'realtime',
        model: 'anthropic/claude-sonnet-4-6',        // any Router model
        instructions: 'You are a warm conversational assistant.',
        output_modalities: ['audio', 'text'],
        audio: {
          input: {
            transcription: { model: 'inworld/inworld-stt-1' },
            turn_detection: {
              type: 'semantic_vad',
              eagerness: 'medium',
              create_response: true,
              interrupt_response: true,
            },
          },
          output: {
            model: 'inworld-tts-2',
            voice: 'Sarah',
          },
        },
      },
    }));
  }

  if (msg.type === 'response.output_audio.delta') {
    audioQueue.push(base64ToPcm16(msg.delta));
    if (!isPlaying) playNext();
  }
});

mic.on('data', (chunk) => {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: chunk.toString('base64'),
  }));
});

Hear the pipeline before you wire it.

Open the Realtime Playground, pick STT, pick the LLM, pick a voice, paste instructions, and click Connect. Speak to your pipeline end to end, hear the voice adapt to your tone, swap any layer mid-session. Copy the config into your code when it sounds right.

Open the playground

FAQ