Speech-to-Speech

Speech-to-speech. Full duplex. Sub-second. Your LLM.

Q: Is this OpenAI Realtime compatible?

Yes. The speech-to-speech API follows the OpenAI Realtime protocol — same event names, same shape, same message flow. Change the base URL, keep your code. See the migration guide.

Q: What's the latency?

Median first audio chunk under a second end-to-end (STT → reasoning → TTS). TTS alone hits under 200ms; the rest depends on your chosen LLM.

Q: Which LLMs can I use?

Any LLM through the Inworld Router — OpenAI, Anthropic, Google, Meta, Mistral, xAI, Groq, and hundreds more. One field on session.update.

Q: Which transports are supported?

WebSocket is GA. WebRTC and SIP are in early access — reach out for access.

Q: Why a cascade pipeline instead of a unified S2S model?

Cascade means STT, Router/LLM, and TTS are separate, swappable layers inside the session. Unified S2S models hide them in a black box. Cascade gives you control, debuggability, and per-layer tuning that end-to-end models can't match. It's a deliberate architectural choice.

Q: Does it handle interruptions?

Yes. Semantic VAD detects when the user starts speaking and cancels the in-flight response. Audio connection stays open.

Q: Does it support function calling?

Yes. Declare tools at session.update. When the model calls a function, the audio connection stays open while your server returns the result.

Q: Is this generally available?

Research preview. WebSocket GA, WebRTC + SIP early access.

Stream audio in, get audio out, over one open session. Built on OpenAI Realtime? Change the base URL and keep your code. Sub-second end-to-end, and every layer (STT model, LLM, voice) is your choice, not ours.

Start streaming Read the docs

S2S session

Audio in

turn_detection semantic_vadstt.model inworld/inworld-stt-1

PCM16 24kHz · 100ms chunks

Audio out

session.model anthropic/claude-sonnet-4-6voice Clive

response.output_audio.delta · streaming

Works with

Realtime API Router TTS STT

Full-duplex speech-to-speech, without the black box.

Your agent hears, thinks, and speaks in under a second, with any LLM and any voice you pick.

Full-duplex audio

One connection. Audio in and out at the same time.

Bidirectional streaming over a single WebSocket session. No polling, no REST round-trips, no gap between user silence and agent voice. Responses form in real time.

One WebSocket · full duplex

input_audio_buffer.append

PCM16 · 24kHz · 100ms chunks

response.output_audio.delta

streaming · base64 PCM

Full-duplex audio

One connection. Audio in and out at the same time.

Bidirectional streaming over a single WebSocket session. No polling, no REST round-trips, no gap between user silence and agent voice. Responses form in real time.

One WebSocket · full duplex

input_audio_buffer.append

PCM16 · 24kHz · 100ms chunks

response.output_audio.delta

streaming · base64 PCM

OpenAI Realtime compatible

Same event names, same flow, same SDKs.

Every event in the OpenAI Realtime protocol works identically. Built on gpt-realtime? Change the base URL, keep the code. No rewrite, no new abstraction.

Protocol you already know

OpenAI Realtime compatible

session.update

input_audio_buffer.append

conversation.item.create

response.create

response.output_audio.delta

response.function_call_arguments.done

Same event names, same shape. Change the base URL, keep the code.

Protocol you already know

OpenAI Realtime compatible

session.update

input_audio_buffer.append

conversation.item.create

response.create

response.output_audio.delta

response.function_call_arguments.done

Same event names, same shape. Change the base URL, keep the code.

OpenAI Realtime compatible

Same event names, same flow, same SDKs.

Every event in the OpenAI Realtime protocol works identically. Built on gpt-realtime? Change the base URL, keep the code. No rewrite, no new abstraction.

Sub-second end-to-end

Speech in, speech out, under a second.

Median first audio chunk lands under a second including STT, reasoning, and TTS. The wait is a beat your users barely notice.

Speech-to-speech · end-to-end

<1s

Median speech in, speech out

STT → reasoning → TTS. All on one connection.

Sub-second end-to-end

Speech in, speech out, under a second.

Median first audio chunk lands under a second including STT, reasoning, and TTS. The wait is a beat your users barely notice.

Speech-to-speech · end-to-end

<1s

Median speech in, speech out

STT → reasoning → TTS. All on one connection.

Cascade by design

Works with

STT

Router

TTS

Swap the voice, the model, or the STT without a rewrite.

Unified S2S models hide the pipeline in a black box. Cascade exposes every layer so you can debug, tune, and change your mind without retraining.

Cascade by design

Unified S2S

Inworld cascade

STT

gpt-realtime internal

inworld/inworld-stt-1, Whisper, AssemblyAI

LLM

gpt-realtime internal

claude-sonnet-4-6, gpt-5.4, Llama 4, +200

Voice

preset Alloy/Echo

inworld-tts-2, custom cloned

Unified models hide each layer. Cascade exposes every knob so you can debug, swap, tune.

Cascade by design

Unified S2S

Inworld cascade

STT

gpt-realtime internal

inworld/inworld-stt-1, Whisper, AssemblyAI

LLM

gpt-realtime internal

claude-sonnet-4-6, gpt-5.4, Llama 4, +200

Voice

preset Alloy/Echo

inworld-tts-2, custom cloned

Unified models hide each layer. Cascade exposes every knob so you can debug, swap, tune.

Cascade by design

Works with

STT

Router

TTS

Swap the voice, the model, or the STT without a rewrite.

Unified S2S models hide the pipeline in a black box. Cascade exposes every layer so you can debug, tune, and change your mind without retraining.

Semantic turn-taking

Reads the pause like a person does.

Semantic VAD tells thinking apart from finishing, so agents never cut mid-thought or wait out real silence. Barge-in works. Eagerness is per session.

Semantic VAD

User says

"I was thinking..."

(0.8s pause)

Amplitude VAD

Cuts in

Silence triggers response

Semantic VAD

Waits

Trailing phrase is mid-thought

Reads grammar, not just audio amplitude. Configurable eagerness per session.

Semantic turn-taking

Reads the pause like a person does.

Semantic VAD tells thinking apart from finishing, so agents never cut mid-thought or wait out real silence. Barge-in works. Eagerness is per session.

Semantic VAD

User says

"I was thinking..."

(0.8s pause)

Amplitude VAD

Cuts in

Silence triggers response

Semantic VAD

Waits

Trailing phrase is mid-thought

Reads grammar, not just audio amplitude. Configurable eagerness per session.

Three transports, one API

Ship the same agent to the browser, the server, and the phone line.

WebSocket for servers. WebRTC for browsers and native apps. SIP for phone systems and call centers with G.711. One session config covers all three.

Three transports, one API

one session config

WebSocket

Servers, back-end clients

WebRTC

Early access

Browsers, native clients

SIP

Early access

Phone, call center

Three transports, one API

one session config

WebSocket

Servers, back-end clients

WebRTC

Early access

Browsers, native clients

SIP

Early access

Phone, call center

Three transports, one API

Ship the same agent to the browser, the server, and the phone line.

WebSocket for servers. WebRTC for browsers and native apps. SIP for phone systems and call centers with G.711. One session config covers all three.

Stream in forty lines

Open the session, pipe the mic in, play the audio out. Your agent sounds alive on the first packet.

const ws = new WebSocket(
  'wss://api.inworld.ai/api/v1/realtime/session?key=' + sessionId,
  ['realtime']
);

ws.on('message', (raw) => {
  const event = JSON.parse(raw.toString());

  if (event.type === 'session.created') {
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        type: 'realtime',
        model: 'anthropic/claude-sonnet-4-6',
        output_modalities: ['audio', 'text'],
        audio: {
          input: {
            transcription: { model: 'inworld/inworld-stt-1' },
            turn_detection: { type: 'semantic_vad', eagerness: 'medium' },
          },
          output: { model: 'inworld-tts-2', voice: 'Clive' },
        },
      },
    }));
  }

  if (event.type === 'response.output_audio.delta') {
    audioQueue.push(base64ToPcm16(event.delta));
  }
});

mic.on('data', (chunk) => {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: chunk.toString('base64'),
  }));
});

const ws = new WebSocket(
  'wss://api.inworld.ai/api/v1/realtime/session?key=' + sessionId,
  ['realtime']
);

ws.on('message', (raw) => {
  const event = JSON.parse(raw.toString());

  if (event.type === 'session.created') {
    ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        type: 'realtime',
        model: 'anthropic/claude-sonnet-4-6',
        output_modalities: ['audio', 'text'],
        audio: {
          input: {
            transcription: { model: 'inworld/inworld-stt-1' },
            turn_detection: { type: 'semantic_vad', eagerness: 'medium' },
          },
          output: { model: 'inworld-tts-2', voice: 'Clive' },
        },
      },
    }));
  }

  if (event.type === 'response.output_audio.delta') {
    audioQueue.push(base64ToPcm16(event.delta));
  }
});

mic.on('data', (chunk) => {
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: chunk.toString('base64'),
  }));
});

FAQ

Yes. The speech-to-speech API follows the OpenAI Realtime protocol — same event names, same shape, same message flow. Change the base URL, keep your code. See the migration guide.

Median first audio chunk under a second end-to-end (STT → reasoning → TTS). TTS alone hits under 200ms; the rest depends on your chosen LLM.

Any LLM through the Inworld Router — OpenAI, Anthropic, Google, Meta, Mistral, xAI, Groq, and hundreds more. One field on session.update.

WebSocket is GA. WebRTC and SIP are in early access — reach out for access.

Cascade means STT, Router/LLM, and TTS are separate, swappable layers inside the session. Unified S2S models hide them in a black box. Cascade gives you control, debuggability, and per-layer tuning that end-to-end models can't match. It's a deliberate architectural choice.

Yes. Semantic VAD detects when the user starts speaking and cancels the in-flight response. Audio connection stays open.

Yes. Declare tools at session.update. When the model calls a function, the audio connection stays open while your server returns the result.

Research preview. WebSocket GA, WebRTC + SIP early access.