Realtime STT

Speech-to-text that truly understands your users in realtime

Realtime streaming recognition with voice profiling — emotion, vocal style, accent, age, and pitch extracted from raw audio. Feed signals straight into your LLM and TTS for adaptive, expressive responses.

Test in Playground Read the docs

<100ms

Latency

Voice Profile Signals

100+

Languages

Choose your endpoint.

Stream audio in real time, transcribe complete files, or extract voice profile signals — all through one unified API.

Realtime bidirectional streaming over WebSocket
Synchronous transcription for complete audio files
Voice Profile signals on every streaming chunk
Multi-provider support via a single model ID

wscat -c 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional' \
  -H "Authorization: Basic $INWORLD_API_KEY"

# Send config as first message:
{

  "transcribeConfig": { "modelId": "inworld/inworld-stt-1",

    "audioEncoding": "LINEAR16",
    "sampleRateHertz": 16000,
    "language": "en-US",
    "voiceProfileConfig": {
        "enableVoiceProfile": true
    }
  }
}

wscat -c 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional' \
  -H "Authorization: Basic $INWORLD_API_KEY"

# Send config as first message:
{

  "transcribeConfig": { "modelId": "inworld/inworld-stt-1",

    "audioEncoding": "LINEAR16",
    "sampleRateHertz": 16000,
    "language": "en-US",
    "voiceProfileConfig": {
        "enableVoiceProfile": true
    }
  }
}

Choose your endpoint.

Stream audio in real time, transcribe complete files, or extract voice profile signals — all through one unified API.

Realtime bidirectional streaming over WebSocket
Synchronous transcription for complete audio files
Voice Profile signals on every streaming chunk
Multi-provider support via a single model ID

Voice profiling hears who's speaking, not just their words.

Every audio chunk produces a realtime profile of the speaker: emotion, vocal style, accent, age, and pitch — extracted from raw audio with confidence scores. The signal that turns a transcript into context your LLM and TTS can act on.

5 paralinguistic signals per audio chunk, with confidence scores
Configurable threshold to filter low-confidence results
Feeds into LLM context and Realtime TTS-2 steering downstream
Available on the inworld/inworld-stt-1 model

Test out Profiling

Voice profile signals

Emotion

Frustrated

84%

Age

Adult

84%

Accent

British

84%

Pitch

High

84%

Vocal Style

Shouting

84%

More signals coming soon

Voice profiling hears who's speaking, not just their words.

5 paralinguistic signals per audio chunk, with confidence scores
Configurable threshold to filter low-confidence results
Feeds into LLM context and Realtime TTS-2 steering downstream
Available on the inworld/inworld-stt-1 model

Test out Profiling

Voice profile signals

Emotion

Frustrated

84%

Age

Adult

84%

Accent

British

84%

Pitch

High

84%

Vocal Style

Shouting

84%

More signals coming soon

Voice profile steers Realtime TTS-2 in realtime.

Voice profile signals flow into the LLM as context. The LLM emits Realtime TTS-2 steering tags and non-verbals inline, and Realtime TTS-2 renders an expressive response: natural pacing, soft delivery, and a real sigh, all driven by the user's voice profile.

Voice profile drops into LLM context as structured metadata
LLM emits inline steering tags like [Speak softly] and non-verbals like [sigh] [breathe]
Realtime TTS-2 renders the markup as natural, expressive audio
Wired end-to-end through the Realtime API

Test Out Realtime

1. User audio → STT voice profile

emotion: sad · style: soft · pitch: low

2. LLM response

[Speak softly] I'm so sorry to hear that. [sigh] Let's figure this out together.

3. Realtime TTS-2 expressive output

voice: Sarah · model: inworld-tts-2

1. User audio → STT voice profile

emotion: sad · style: soft · pitch: low

2. LLM response

[Speak softly] I'm so sorry to hear that. [sigh] Let's figure this out together.

3. Realtime TTS-2 expressive output

voice: Sarah · model: inworld-tts-2

Voice profile steers Realtime TTS-2 in realtime.

Voice profile drops into LLM context as structured metadata
LLM emits inline steering tags like [Speak softly] and non-verbals like [sigh] [breathe]
Realtime TTS-2 renders the markup as natural, expressive audio
Wired end-to-end through the Realtime API

Test Out Realtime

Realtime speech recognition, built for production.

Low-latency streaming over WebSocket with semantic VAD, word-level timestamps, speaker diarization (coming soon), and custom vocabulary. A single unified API across industry-leading transcription providers.

Bidirectional WebSocket streaming for live audio
Semantic & acoustic VAD detects intent, not just silence
Word-level timestamps and speaker diarization (coming soon)
Custom vocabulary to boost domain-specific terms
Unified API across 6+ models from multiple providers

Test out Models

One API, 6+ models

Provider

Model

Inworld

inworld-stt-1

Groq

whisper-large-v3

AssemblyAI

universal-streaming-multilingual

AssemblyAI

universal-streaming-english

AssemblyAI

u3-rt-pro

AssemblyAI

whisper-rt

Realtime speech recognition, built for production.

Bidirectional WebSocket streaming for live audio
Semantic & acoustic VAD detects intent, not just silence
Word-level timestamps and speaker diarization (coming soon)
Custom vocabulary to boost domain-specific terms
Unified API across 6+ models from multiple providers

Test out Models

One API, 6+ models

Provider

Model

Inworld

inworld-stt-1

Groq

whisper-large-v3

AssemblyAI

universal-streaming-multilingual

AssemblyAI

universal-streaming-english

AssemblyAI

u3-rt-pro

AssemblyAI

whisper-rt

Fully multilingual. One API, any language.

One STT API to access any language and benchmark-leading quality. Whether you specialize in one predominant language or need over 100 languages available at your fingertips.

Choose from models supporting up to over 100 languages
Realtime streaming in English, Spanish, French, German, Italian, and Portuguese
Voice profiling available across all models
Switch providers and languages with a single parameter

Test out Languages

100+

languages

🇬🇧English🇪🇸Español🇨🇳中文🇮🇳हिन्दी🇯🇵日本語🇰🇷한국어🇫🇷Français🇩🇪Deutsch

and many more

100+

languages

🇬🇧English🇪🇸Español🇨🇳中文🇮🇳हिन्दी🇯🇵日本語🇰🇷한국어🇫🇷Français🇩🇪Deutsch

and many more

Fully multilingual. One API, any language.

One STT API to access any language and benchmark-leading quality. Whether you specialize in one predominant language or need over 100 languages available at your fingertips.

Choose from models supporting up to over 100 languages
Realtime streaming in English, Spanish, French, German, Italian, and Portuguese
Voice profiling available across all models
Switch providers and languages with a single parameter

Test out Languages

Designed for realtime interactive audio.

Every feature is designed for low-latency, high-accuracy speech recognition in production voice applications.

Realtime bidirectional streaming over WebSocket
Semantic & acoustic VAD for natural turn-taking
Unified multi-provider API with consistent auth and formatting
High accuracy with custom vocabulary boosting
Word-level timestamps and speaker diarization (coming soon)
Voice & context profiling for user-aware responses

Test out Latency

~100ms streaming latency

~0ms

Realtime STT

~0ms

OpenAI Whisper

~0ms

Google Cloud

Designed for realtime interactive audio.

Every feature is designed for low-latency, high-accuracy speech recognition in production voice applications.

Realtime bidirectional streaming over WebSocket
Semantic & acoustic VAD for natural turn-taking
Unified multi-provider API with consistent auth and formatting
High accuracy with custom vocabulary boosting
Word-level timestamps and speaker diarization (coming soon)
Voice & context profiling for user-aware responses

Test out Latency

~100ms streaming latency

~0ms

Realtime STT

~0ms

OpenAI Whisper

~0ms

Google Cloud

Use cases

Realtime STT powers any application where understanding speech in realtime is critical.

Voice agents & customer support

Stream caller speech in realtime via WebSocket. Voice profiling detects emotion and vocal style to route calls and adapt agent behavior dynamically.

Get started

Integrate Realtime STT with a few lines of code. Choose between realtime streaming over WebSocket or batch transcription.

Get Started View Docs

import asyncio
import base64
import json
import wave
import websockets

API_KEY = "<YOUR_API_KEY>"
WS_URL = "wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional"

async def stream_transcribe():
    headers = {"Authorization": f"Basic {API_KEY}"}

    async with websockets.connect(WS_URL, additional_headers=headers) as ws:
        # Read WAV and extract raw PCM
        with wave.open("audio.wav", "rb") as wf:
            sample_rate = wf.getframerate()
            channels = wf.getnchannels()
            pcm = wf.readframes(wf.getnframes())

        # 1. Send transcription config
        await ws.send(json.dumps({
            "transcribeConfig": {
                "modelId": "inworld/inworld-stt-1",
                "audioEncoding": "LINEAR16",
                "sampleRateHertz": 16000,
                "numberOfChannels": 1,
                "language": "en-US"
            }
        }))

        # 2. Stream audio in 100 ms chunks (base64-encoded)
        chunk_bytes = int(sample_rate * 2 * channels * 0.1)
        for i in range(0, len(pcm), chunk_bytes):
            chunk = pcm[i : i + chunk_bytes]
            await ws.send(json.dumps({
                "audioChunk": {"content": base64.b64encode(chunk).decode()}
            }))
            await asyncio.sleep(0.1)

        # 3. Signal end of turn
        await ws.send(json.dumps({"endTurn": {}}))

        # 4. Receive results until final
        while True:
            try:
                raw = await asyncio.wait_for(ws.recv(), timeout=10)
                msg = json.loads(raw)
                t = msg.get("result", {}).get("transcription", {})
                if t:
                    tag = "[FINAL]" if t.get("isFinal") else "[partial]"
                    print(f"{tag} {t.get('transcript', '')}")
                    if t.get("isFinal"):
                        break
            except asyncio.TimeoutError:
                break

        # 5. Close the stream
        await ws.send(json.dumps({"closeStream": {}}))

asyncio.run(stream_transcribe())

import WebSocket from 'ws';
import fs from 'fs';
import { execSync } from 'child_process';

const API_KEY = '<YOUR_API_KEY>';
const WS_URL = 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional';

// Convert WAV to raw PCM (16-bit LE, 16 kHz, mono)
execSync('ffmpeg -y -i audio.wav -f s16le -ar 16000 -ac 1 audio.raw');
const pcm = fs.readFileSync('audio.raw');

const ws = new WebSocket(WS_URL, {
    headers: { Authorization: `Basic ${API_KEY}` },
});

ws.on('open', () => {
    // 1. Send transcription config
    ws.send(JSON.stringify({
        transcribeConfig: {
            modelId: 'inworld/inworld-stt-1',
            audioEncoding: 'LINEAR16',
            sampleRateHertz: 16000,
            numberOfChannels: 1,
            language: 'en-US',
        },
    }));

    // 2. Stream audio in 100 ms chunks (3200 bytes @ 16 kHz mono 16-bit)
    const chunkBytes = 16000 * 2 * 0.1;
    let offset = 0;
    const interval = setInterval(() => {
        if (offset >= pcm.length) {
            clearInterval(interval);
            ws.send(JSON.stringify({ endTurn: {} }));
            return;
        }
        const chunk = pcm.subarray(offset, offset + chunkBytes);
        ws.send(JSON.stringify({
            audioChunk: { content: chunk.toString('base64') },
        }));
        offset += chunkBytes;
    }, 100);
});

ws.on('message', (data) => {
    const msg = JSON.parse(data);
    const t = msg?.result?.transcription;
    if (t) console.log(t.isFinal ? '[FINAL]' : '[partial]', t.transcript);
    if (t?.isFinal) {
        ws.send(JSON.stringify({ closeStream: {} }));
        ws.close();
    }
});

import asyncio
import base64
import json
import wave
import websockets

API_KEY = "<YOUR_API_KEY>"
WS_URL = "wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional"

async def stream_transcribe():
    headers = {"Authorization": f"Basic {API_KEY}"}

    async with websockets.connect(WS_URL, additional_headers=headers) as ws:
        # Read WAV and extract raw PCM
        with wave.open("audio.wav", "rb") as wf:
            sample_rate = wf.getframerate()
            channels = wf.getnchannels()
            pcm = wf.readframes(wf.getnframes())

        # 1. Send transcription config
        await ws.send(json.dumps({
            "transcribeConfig": {
                "modelId": "inworld/inworld-stt-1",
                "audioEncoding": "LINEAR16",
                "sampleRateHertz": 16000,
                "numberOfChannels": 1,
                "language": "en-US"
            }
        }))

        # 2. Stream audio in 100 ms chunks (base64-encoded)
        chunk_bytes = int(sample_rate * 2 * channels * 0.1)
        for i in range(0, len(pcm), chunk_bytes):
            chunk = pcm[i : i + chunk_bytes]
            await ws.send(json.dumps({
                "audioChunk": {"content": base64.b64encode(chunk).decode()}
            }))
            await asyncio.sleep(0.1)

        # 3. Signal end of turn
        await ws.send(json.dumps({"endTurn": {}}))

        # 4. Receive results until final
        while True:
            try:
                raw = await asyncio.wait_for(ws.recv(), timeout=10)
                msg = json.loads(raw)
                t = msg.get("result", {}).get("transcription", {})
                if t:
                    tag = "[FINAL]" if t.get("isFinal") else "[partial]"
                    print(f"{tag} {t.get('transcript', '')}")
                    if t.get("isFinal"):
                        break
            except asyncio.TimeoutError:
                break

        # 5. Close the stream
        await ws.send(json.dumps({"closeStream": {}}))

asyncio.run(stream_transcribe())

import WebSocket from 'ws';
import fs from 'fs';
import { execSync } from 'child_process';

const API_KEY = '<YOUR_API_KEY>';
const WS_URL = 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional';

// Convert WAV to raw PCM (16-bit LE, 16 kHz, mono)
execSync('ffmpeg -y -i audio.wav -f s16le -ar 16000 -ac 1 audio.raw');
const pcm = fs.readFileSync('audio.raw');

const ws = new WebSocket(WS_URL, {
    headers: { Authorization: `Basic ${API_KEY}` },
});

ws.on('open', () => {
    // 1. Send transcription config
    ws.send(JSON.stringify({
        transcribeConfig: {
            modelId: 'inworld/inworld-stt-1',
            audioEncoding: 'LINEAR16',
            sampleRateHertz: 16000,
            numberOfChannels: 1,
            language: 'en-US',
        },
    }));

    // 2. Stream audio in 100 ms chunks (3200 bytes @ 16 kHz mono 16-bit)
    const chunkBytes = 16000 * 2 * 0.1;
    let offset = 0;
    const interval = setInterval(() => {
        if (offset >= pcm.length) {
            clearInterval(interval);
            ws.send(JSON.stringify({ endTurn: {} }));
            return;
        }
        const chunk = pcm.subarray(offset, offset + chunkBytes);
        ws.send(JSON.stringify({
            audioChunk: { content: chunk.toString('base64') },
        }));
        offset += chunkBytes;
    }, 100);
});

ws.on('message', (data) => {
    const msg = JSON.parse(data);
    const t = msg?.result?.transcription;
    if (t) console.log(t.isFinal ? '[FINAL]' : '[partial]', t.transcript);
    if (t?.isFinal) {
        ws.send(JSON.stringify({ closeStream: {} }));
        ws.close();
    }
});

FAQs

Voice profiling extracts five paralinguistic signals from raw audio on every streaming chunk: emotion (happy, angry, sad, frustrated, calm, surprised, fearful, tender), vocal style (shouting, whispering, laughing, crying, singing, monotone, mumbling), accent, age (kid, young, adult, old), and pitch (high, medium, low). Each signal includes a confidence score, and a configurable threshold filters low-confidence results. Available on the inworld/inworld-stt-1 model.

Voice profile signals can be passed straight into the Realtime API as LLM context — for example, the user sounds sad and is speaking softly. The LLM can then emit Realtime TTS-2 steering instructions and non-verbals inline ([Speak softly] I'm so sorry [sigh]), which Realtime TTS-2 renders as expressive, context-aware audio. The full STT → LLM → TTS pipeline runs end-to-end in the Realtime API with no extra plumbing.

Realtime STT is multilingual, with language support depending on the underlying STT model chosen. Whisper Large v3 supports over 100 languages, while AssemblyAI's Multilingual Universal-Streaming model supports six languages: English, Spanish, French, German, Italian, and Portuguese.

Rates for all models are available here.

Realtime STT supports both real-time bidirectional streaming over WebSocket for live audio and synchronous transcription for complete audio files.

Realtime STT integrates seamlessly into the Realtime API, allowing you to easily create and deploy end-to-end, realtime voice pipelines.

Yes. Inworld supports Zero Data Retention (ZDR) across various models and providers available through the STT API. With ZDR, audio and transcription data are processed in real time and never stored. Visit our Security page to learn more.