Realtime STT

Speech-to-text that truly understands your users in realtime

Realtime streaming recognition with diarization, custom vocabularies, and voice profiling. Built for interactive audio applications.

Test in Playground Read the docs

<100ms

Latency

Voice Profile Signals

100+

Languages

wscat -c 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional' \
  -H "Authorization: Basic $INWORLD_API_KEY"

# Send config as first message:
{

  "transcribeConfig": { "modelId": "inworld/inworld-stt-1",

    "audioEncoding": "LINEAR16",
    "sampleRateHertz": 16000,
    "language": "en-US",
    "voiceProfileConfig": {
        "enableVoiceProfile": true
    }
  }
}

Choose your endpoint.

Stream audio in real time, transcribe complete files, or extract voice profile signals — all through one unified API.

Realtime bidirectional streaming over WebSocket
Synchronous transcription for complete audio files
Voice Profile signals on every streaming chunk
Multi-provider support via a single model ID

Understand users and their context. Engage them most effectively

Realtime profiling extracts emotion, accent, age, vocal style, environment, and language from every voice interaction. Updated with every audio chunk.

Voice profile signals

Emotion

Frustrated

84%

Age

Adult

84%

Accent

British

84%

Pitch

High

84%

Vocal Style

Shouting

84%

More signals coming soon

Emotion detectionHappy, calm, angry, frustrated, neutral, tender. Adapt tone and routing based on how the user feels in realtime.

Language & accenten-US, en-GB, en-IN, es-419, and more. Regional accent classification with confidence scores for each chunk.

Age & vocal pitchKid, young, adult, or old. High, mid, or low pitch. Personalize content and voice selection for the audience.

Vocal style & environmentNormal, whispering, shouting, laughing, singing. Quiet room, busy street, car. Classify both how and where they speak.

→ Condition RouterPass profile fields as metadata to Router. CEL conditions route by emotion, language, or tier. For example, frustrated users get empathetic models, Spanish speakers get multilingual providers.

→ Condition TTSUse profile signals as steering parameters for TTS. Calm users get a direct tone. Frustrated users hear empathetic pacing. Vocal pitch informs voice selection automatically.

Built for realtime interactive audio

Every feature is designed for low-latency, high-accuracy speech recognition in production voice applications.

Get Started View Docs

Realtime streaming

Realtime, bidirectional streaming over WebSocket for live audio, or synchronous transcription for complete audio files.

Semantic & acoustic VAD

Automatically detect when speech starts and stops. Enable natural speech patterns.

Voice & context profiling

Understand the profile, context and state of your users to contextualize responses.

Unified multi-provider API

A single integration point for industry-leading, high-accuracy transcription providers, with consistent authentication, request formatting, and response handling.

High accuracy & custom vocabulary

Transcribe audio with industry-leading accuracy. Add domain-specific terms, product names, and specialized vocabulary to boost recognition further.

Word-level timestamps & diarization

Per-word timing for subtitles and search. Label speakers in multi-party conversations.

Get started

Integrate Realtime STT with a few lines of code. Choose between realtime streaming over WebSocket or batch transcription.

Get Started View Docs

import asyncio
import base64
import json
import wave
import websockets

API_KEY = "<YOUR_API_KEY>"
WS_URL = "wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional"

async def stream_transcribe():
    headers = {"Authorization": f"Basic {API_KEY}"}

    async with websockets.connect(WS_URL, additional_headers=headers) as ws:
        # Read WAV and extract raw PCM
        with wave.open("audio.wav", "rb") as wf:
            sample_rate = wf.getframerate()
            channels = wf.getnchannels()
            pcm = wf.readframes(wf.getnframes())

        # 1. Send transcription config
        await ws.send(json.dumps({
            "transcribeConfig": {
                "modelId": "inworld/inworld-stt-1",
                "audioEncoding": "LINEAR16",
                "sampleRateHertz": 16000,
                "numberOfChannels": 1,
                "language": "en-US"
            }
        }))

        # 2. Stream audio in 100 ms chunks (base64-encoded)
        chunk_bytes = int(sample_rate * 2 * channels * 0.1)
        for i in range(0, len(pcm), chunk_bytes):
            chunk = pcm[i : i + chunk_bytes]
            await ws.send(json.dumps({
                "audioChunk": {"content": base64.b64encode(chunk).decode()}
            }))
            await asyncio.sleep(0.1)

        # 3. Signal end of turn
        await ws.send(json.dumps({"endTurn": {}}))

        # 4. Receive results until final
        while True:
            try:
                raw = await asyncio.wait_for(ws.recv(), timeout=10)
                msg = json.loads(raw)
                t = msg.get("result", {}).get("transcription", {})
                if t:
                    tag = "[FINAL]" if t.get("isFinal") else "[partial]"
                    print(f"{tag} {t.get('transcript', '')}")
                    if t.get("isFinal"):
                        break
            except asyncio.TimeoutError:
                break

        # 5. Close the stream
        await ws.send(json.dumps({"closeStream": {}}))

asyncio.run(stream_transcribe())

import WebSocket from 'ws';
import fs from 'fs';
import { execSync } from 'child_process';

const API_KEY = '<YOUR_API_KEY>';
const WS_URL = 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional';

// Convert WAV to raw PCM (16-bit LE, 16 kHz, mono)
execSync('ffmpeg -y -i audio.wav -f s16le -ar 16000 -ac 1 audio.raw');
const pcm = fs.readFileSync('audio.raw');

const ws = new WebSocket(WS_URL, {
    headers: { Authorization: `Basic ${API_KEY}` },
});

ws.on('open', () => {
    // 1. Send transcription config
    ws.send(JSON.stringify({
        transcribeConfig: {
            modelId: 'inworld/inworld-stt-1',
            audioEncoding: 'LINEAR16',
            sampleRateHertz: 16000,
            numberOfChannels: 1,
            language: 'en-US',
        },
    }));

    // 2. Stream audio in 100 ms chunks (3200 bytes @ 16 kHz mono 16-bit)
    const chunkBytes = 16000 * 2 * 0.1;
    let offset = 0;
    const interval = setInterval(() => {
        if (offset >= pcm.length) {
            clearInterval(interval);
            ws.send(JSON.stringify({ endTurn: {} }));
            return;
        }
        const chunk = pcm.subarray(offset, offset + chunkBytes);
        ws.send(JSON.stringify({
            audioChunk: { content: chunk.toString('base64') },
        }));
        offset += chunkBytes;
    }, 100);
});

ws.on('message', (data) => {
    const msg = JSON.parse(data);
    const t = msg?.result?.transcription;
    if (t) console.log(t.isFinal ? '[FINAL]' : '[partial]', t.transcript);
    if (t?.isFinal) {
        ws.send(JSON.stringify({ closeStream: {} }));
        ws.close();
    }
});

import asyncio
import base64
import json
import wave
import websockets

API_KEY = "<YOUR_API_KEY>"
WS_URL = "wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional"

async def stream_transcribe():
    headers = {"Authorization": f"Basic {API_KEY}"}

    async with websockets.connect(WS_URL, additional_headers=headers) as ws:
        # Read WAV and extract raw PCM
        with wave.open("audio.wav", "rb") as wf:
            sample_rate = wf.getframerate()
            channels = wf.getnchannels()
            pcm = wf.readframes(wf.getnframes())

        # 1. Send transcription config
        await ws.send(json.dumps({
            "transcribeConfig": {
                "modelId": "inworld/inworld-stt-1",
                "audioEncoding": "LINEAR16",
                "sampleRateHertz": 16000,
                "numberOfChannels": 1,
                "language": "en-US"
            }
        }))

        # 2. Stream audio in 100 ms chunks (base64-encoded)
        chunk_bytes = int(sample_rate * 2 * channels * 0.1)
        for i in range(0, len(pcm), chunk_bytes):
            chunk = pcm[i : i + chunk_bytes]
            await ws.send(json.dumps({
                "audioChunk": {"content": base64.b64encode(chunk).decode()}
            }))
            await asyncio.sleep(0.1)

        # 3. Signal end of turn
        await ws.send(json.dumps({"endTurn": {}}))

        # 4. Receive results until final
        while True:
            try:
                raw = await asyncio.wait_for(ws.recv(), timeout=10)
                msg = json.loads(raw)
                t = msg.get("result", {}).get("transcription", {})
                if t:
                    tag = "[FINAL]" if t.get("isFinal") else "[partial]"
                    print(f"{tag} {t.get('transcript', '')}")
                    if t.get("isFinal"):
                        break
            except asyncio.TimeoutError:
                break

        # 5. Close the stream
        await ws.send(json.dumps({"closeStream": {}}))

asyncio.run(stream_transcribe())

import WebSocket from 'ws';
import fs from 'fs';
import { execSync } from 'child_process';

const API_KEY = '<YOUR_API_KEY>';
const WS_URL = 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional';

// Convert WAV to raw PCM (16-bit LE, 16 kHz, mono)
execSync('ffmpeg -y -i audio.wav -f s16le -ar 16000 -ac 1 audio.raw');
const pcm = fs.readFileSync('audio.raw');

const ws = new WebSocket(WS_URL, {
    headers: { Authorization: `Basic ${API_KEY}` },
});

ws.on('open', () => {
    // 1. Send transcription config
    ws.send(JSON.stringify({
        transcribeConfig: {
            modelId: 'inworld/inworld-stt-1',
            audioEncoding: 'LINEAR16',
            sampleRateHertz: 16000,
            numberOfChannels: 1,
            language: 'en-US',
        },
    }));

    // 2. Stream audio in 100 ms chunks (3200 bytes @ 16 kHz mono 16-bit)
    const chunkBytes = 16000 * 2 * 0.1;
    let offset = 0;
    const interval = setInterval(() => {
        if (offset >= pcm.length) {
            clearInterval(interval);
            ws.send(JSON.stringify({ endTurn: {} }));
            return;
        }
        const chunk = pcm.subarray(offset, offset + chunkBytes);
        ws.send(JSON.stringify({
            audioChunk: { content: chunk.toString('base64') },
        }));
        offset += chunkBytes;
    }, 100);
});

ws.on('message', (data) => {
    const msg = JSON.parse(data);
    const t = msg?.result?.transcription;
    if (t) console.log(t.isFinal ? '[FINAL]' : '[partial]', t.transcript);
    if (t?.isFinal) {
        ws.send(JSON.stringify({ closeStream: {} }));
        ws.close();
    }
});

FAQs

Realtime STT is multilingual, with language support depending on the underlying STT model chosen. Whisper Large v3 supports 100+ languages, while AssemblyAI's Multilingual Universal-Streaming model supports six languages: English, Spanish, French, German, Italian, and Portuguese.

Rates for all models are available here.

Realtime STT supports both real-time bidirectional streaming over WebSocket for live audio and synchronous transcription for complete audio files.

Realtime STT integrates seamlessly into the Realtime API, allowing you to easily create and deploy end-to-end, realtime voice pipelines.

Yes. Inworld supports Zero Data Retention (ZDR) across various models and providers available through the STT API. With ZDR, audio and transcription data are processed in real time and never stored. Visit our Security page to learn more.