Inworld STT

Speech-to-text that truly understands your users in realtime

Realtime streaming recognition with diarization, custom vocabularies, and voice profiling. Built for interactive audio applications.

Test in Playground Read the docs

<100ms

Latency

Voice Profile Signals

100+

Languages

Choose your endpoint.

Stream audio in real time, transcribe complete files, or extract voice profile signals — all through one unified API.

Realtime bidirectional streaming over WebSocket
Synchronous transcription for complete audio files
Voice Profile signals on every streaming chunk
Multi-provider support via a single model ID

wscat -c 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional' \
  -H "Authorization: Basic $INWORLD_API_KEY"

# Send config as first message:
{

  "transcribeConfig": { "modelId": "inworld/inworld-stt-1",

    "audioEncoding": "LINEAR16",
    "sampleRateHertz": 16000,
    "language": "en-US",
    "voiceProfileConfig": {
        "enableVoiceProfile": true
    }
  }
}

wscat -c 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional' \
  -H "Authorization: Basic $INWORLD_API_KEY"

# Send config as first message:
{

  "transcribeConfig": { "modelId": "inworld/inworld-stt-1",

    "audioEncoding": "LINEAR16",
    "sampleRateHertz": 16000,
    "language": "en-US",
    "voiceProfileConfig": {
        "enableVoiceProfile": true
    }
  }
}

Choose your endpoint.

Stream audio in real time, transcribe complete files, or extract voice profile signals — all through one unified API.

Realtime bidirectional streaming over WebSocket
Synchronous transcription for complete audio files
Voice Profile signals on every streaming chunk
Multi-provider support via a single model ID

Understand user context to engage more effectively.

Every voice interaction builds a realtime profile of who is speaking. Emotion, accent, age, vocal style, and language — extracted from raw audio and updated with every chunk.

5 voice profile signals per audio chunk
Confidence scores for every classification
Configurable threshold to filter low-confidence results
Feed signals into Router or TTS for adaptive responses

Test out Profiling

Voice profile signals

Emotion

Frustrated

84%

Age

Adult

84%

Accent

British

84%

Pitch

High

84%

Vocal Style

Shouting

84%

More signals coming soon

Understand user context to engage more effectively.

Every voice interaction builds a realtime profile of who is speaking. Emotion, accent, age, vocal style, and language — extracted from raw audio and updated with every chunk.

5 voice profile signals per audio chunk
Confidence scores for every classification
Configurable threshold to filter low-confidence results
Feed signals into Router or TTS for adaptive responses

Test out Profiling

Voice profile signals

Emotion

Frustrated

84%

Age

Adult

84%

Accent

British

84%

Pitch

High

84%

Vocal Style

Shouting

84%

More signals coming soon

Realtime speech recognition, built for production.

Low-latency streaming over WebSocket with semantic VAD, word-level timestamps, speaker diarization (coming soon), and custom vocabulary. A single unified API across industry-leading transcription providers.

Bidirectional WebSocket streaming for live audio
Semantic & acoustic VAD detects intent, not just silence
Word-level timestamps and speaker diarization (coming soon)
Custom vocabulary to boost domain-specific terms
Unified API across 6+ models from multiple providers

Test out Models

One API, 6+ models

Provider

Model

Inworld

inworld-stt-1

Groq

whisper-large-v3

AssemblyAI

universal-streaming-multilingual

AssemblyAI

universal-streaming-english

AssemblyAI

u3-rt-pro

AssemblyAI

whisper-rt

One API, 6+ models

Provider

Model

Inworld

inworld-stt-1

Groq

whisper-large-v3

AssemblyAI

universal-streaming-multilingual

AssemblyAI

universal-streaming-english

AssemblyAI

u3-rt-pro

AssemblyAI

whisper-rt

Realtime speech recognition, built for production.

Bidirectional WebSocket streaming for live audio
Semantic & acoustic VAD detects intent, not just silence
Word-level timestamps and speaker diarization (coming soon)
Custom vocabulary to boost domain-specific terms
Unified API across 6+ models from multiple providers

Test out Models

Fully multilingual. One API, any language.

One STT API to access any language and benchmark-leading quality. Whether you specialize in one predominant language or need 100+ languages available at your fingertips.

Choose from models supporting up to 100+ languages
Realtime streaming in English, Spanish, French, German, Italian, and Portuguese
Voice profiling available across all models
Switch providers and languages with a single parameter

Test out Languages

100+

languages

🇬🇧English🇪🇸Español🇨🇳中文🇮🇳हिन्दी🇯🇵日本語🇰🇷한국어🇫🇷Français🇩🇪Deutsch

and many more

Fully multilingual. One API, any language.

One STT API to access any language and benchmark-leading quality. Whether you specialize in one predominant language or need 100+ languages available at your fingertips.

Choose from models supporting up to 100+ languages
Realtime streaming in English, Spanish, French, German, Italian, and Portuguese
Voice profiling available across all models
Switch providers and languages with a single parameter

Test out Languages

100+

languages

🇬🇧English🇪🇸Español🇨🇳中文🇮🇳हिन्दी🇯🇵日本語🇰🇷한국어🇫🇷Français🇩🇪Deutsch

and many more

Designed for realtime interactive audio.

Every feature is designed for low-latency, high-accuracy speech recognition in production voice applications.

Realtime bidirectional streaming over WebSocket
Semantic & acoustic VAD for natural turn-taking
Unified multi-provider API with consistent auth and formatting
High accuracy with custom vocabulary boosting
Word-level timestamps and speaker diarization (coming soon)
Voice & context profiling for user-aware responses

Test out Latency

~100ms streaming latency

~0ms

Inworld STT

~0ms

OpenAI Whisper

~0ms

Google Cloud

~100ms streaming latency

~0ms

Inworld STT

~0ms

OpenAI Whisper

~0ms

Google Cloud

Designed for realtime interactive audio.

Every feature is designed for low-latency, high-accuracy speech recognition in production voice applications.

Realtime bidirectional streaming over WebSocket
Semantic & acoustic VAD for natural turn-taking
Unified multi-provider API with consistent auth and formatting
High accuracy with custom vocabulary boosting
Word-level timestamps and speaker diarization (coming soon)
Voice & context profiling for user-aware responses

Test out Latency

Use cases

Inworld STT powers any application where understanding speech in realtime is critical.

Voice agents & customer support

Stream caller speech in realtime via WebSocket. Voice profiling detects emotion and vocal style to route calls and adapt agent behavior dynamically.

Get started

Integrate Inworld STT with a few lines of code. Choose between realtime streaming over WebSocket or batch transcription.

Get Started View Docs

import asyncio
import base64
import json
import wave
import websockets

API_KEY = "<YOUR_API_KEY>"
WS_URL = "wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional"

async def stream_transcribe():
    headers = {"Authorization": f"Basic {API_KEY}"}

    async with websockets.connect(WS_URL, additional_headers=headers) as ws:
        # Read WAV and extract raw PCM
        with wave.open("audio.wav", "rb") as wf:
            sample_rate = wf.getframerate()
            channels = wf.getnchannels()
            pcm = wf.readframes(wf.getnframes())

        # 1. Send transcription config
        await ws.send(json.dumps({
            "transcribeConfig": {
                "modelId": "inworld/inworld-stt-1",
                "audioEncoding": "LINEAR16",
                "sampleRateHertz": 16000,
                "numberOfChannels": 1,
                "language": "en-US"
            }
        }))

        # 2. Stream audio in 100 ms chunks (base64-encoded)
        chunk_bytes = int(sample_rate * 2 * channels * 0.1)
        for i in range(0, len(pcm), chunk_bytes):
            chunk = pcm[i : i + chunk_bytes]
            await ws.send(json.dumps({
                "audioChunk": {"content": base64.b64encode(chunk).decode()}
            }))
            await asyncio.sleep(0.1)

        # 3. Signal end of turn
        await ws.send(json.dumps({"endTurn": {}}))

        # 4. Receive results until final
        while True:
            try:
                raw = await asyncio.wait_for(ws.recv(), timeout=10)
                msg = json.loads(raw)
                t = msg.get("result", {}).get("transcription", {})
                if t:
                    tag = "[FINAL]" if t.get("isFinal") else "[partial]"
                    print(f"{tag} {t.get('transcript', '')}")
                    if t.get("isFinal"):
                        break
            except asyncio.TimeoutError:
                break

        # 5. Close the stream
        await ws.send(json.dumps({"closeStream": {}}))

asyncio.run(stream_transcribe())

import WebSocket from 'ws';
import fs from 'fs';
import { execSync } from 'child_process';

const API_KEY = '<YOUR_API_KEY>';
const WS_URL = 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional';

// Convert WAV to raw PCM (16-bit LE, 16 kHz, mono)
execSync('ffmpeg -y -i audio.wav -f s16le -ar 16000 -ac 1 audio.raw');
const pcm = fs.readFileSync('audio.raw');

const ws = new WebSocket(WS_URL, {
    headers: { Authorization: `Basic ${API_KEY}` },
});

ws.on('open', () => {
    // 1. Send transcription config
    ws.send(JSON.stringify({
        transcribeConfig: {
            modelId: 'inworld/inworld-stt-1',
            audioEncoding: 'LINEAR16',
            sampleRateHertz: 16000,
            numberOfChannels: 1,
            language: 'en-US',
        },
    }));

    // 2. Stream audio in 100 ms chunks (3200 bytes @ 16 kHz mono 16-bit)
    const chunkBytes = 16000 * 2 * 0.1;
    let offset = 0;
    const interval = setInterval(() => {
        if (offset >= pcm.length) {
            clearInterval(interval);
            ws.send(JSON.stringify({ endTurn: {} }));
            return;
        }
        const chunk = pcm.subarray(offset, offset + chunkBytes);
        ws.send(JSON.stringify({
            audioChunk: { content: chunk.toString('base64') },
        }));
        offset += chunkBytes;
    }, 100);
});

ws.on('message', (data) => {
    const msg = JSON.parse(data);
    const t = msg?.result?.transcription;
    if (t) console.log(t.isFinal ? '[FINAL]' : '[partial]', t.transcript);
    if (t?.isFinal) {
        ws.send(JSON.stringify({ closeStream: {} }));
        ws.close();
    }
});

import asyncio
import base64
import json
import wave
import websockets

API_KEY = "<YOUR_API_KEY>"
WS_URL = "wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional"

async def stream_transcribe():
    headers = {"Authorization": f"Basic {API_KEY}"}

    async with websockets.connect(WS_URL, additional_headers=headers) as ws:
        # Read WAV and extract raw PCM
        with wave.open("audio.wav", "rb") as wf:
            sample_rate = wf.getframerate()
            channels = wf.getnchannels()
            pcm = wf.readframes(wf.getnframes())

        # 1. Send transcription config
        await ws.send(json.dumps({
            "transcribeConfig": {
                "modelId": "inworld/inworld-stt-1",
                "audioEncoding": "LINEAR16",
                "sampleRateHertz": 16000,
                "numberOfChannels": 1,
                "language": "en-US"
            }
        }))

        # 2. Stream audio in 100 ms chunks (base64-encoded)
        chunk_bytes = int(sample_rate * 2 * channels * 0.1)
        for i in range(0, len(pcm), chunk_bytes):
            chunk = pcm[i : i + chunk_bytes]
            await ws.send(json.dumps({
                "audioChunk": {"content": base64.b64encode(chunk).decode()}
            }))
            await asyncio.sleep(0.1)

        # 3. Signal end of turn
        await ws.send(json.dumps({"endTurn": {}}))

        # 4. Receive results until final
        while True:
            try:
                raw = await asyncio.wait_for(ws.recv(), timeout=10)
                msg = json.loads(raw)
                t = msg.get("result", {}).get("transcription", {})
                if t:
                    tag = "[FINAL]" if t.get("isFinal") else "[partial]"
                    print(f"{tag} {t.get('transcript', '')}")
                    if t.get("isFinal"):
                        break
            except asyncio.TimeoutError:
                break

        # 5. Close the stream
        await ws.send(json.dumps({"closeStream": {}}))

asyncio.run(stream_transcribe())

import WebSocket from 'ws';
import fs from 'fs';
import { execSync } from 'child_process';

const API_KEY = '<YOUR_API_KEY>';
const WS_URL = 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional';

// Convert WAV to raw PCM (16-bit LE, 16 kHz, mono)
execSync('ffmpeg -y -i audio.wav -f s16le -ar 16000 -ac 1 audio.raw');
const pcm = fs.readFileSync('audio.raw');

const ws = new WebSocket(WS_URL, {
    headers: { Authorization: `Basic ${API_KEY}` },
});

ws.on('open', () => {
    // 1. Send transcription config
    ws.send(JSON.stringify({
        transcribeConfig: {
            modelId: 'inworld/inworld-stt-1',
            audioEncoding: 'LINEAR16',
            sampleRateHertz: 16000,
            numberOfChannels: 1,
            language: 'en-US',
        },
    }));

    // 2. Stream audio in 100 ms chunks (3200 bytes @ 16 kHz mono 16-bit)
    const chunkBytes = 16000 * 2 * 0.1;
    let offset = 0;
    const interval = setInterval(() => {
        if (offset >= pcm.length) {
            clearInterval(interval);
            ws.send(JSON.stringify({ endTurn: {} }));
            return;
        }
        const chunk = pcm.subarray(offset, offset + chunkBytes);
        ws.send(JSON.stringify({
            audioChunk: { content: chunk.toString('base64') },
        }));
        offset += chunkBytes;
    }, 100);
});

ws.on('message', (data) => {
    const msg = JSON.parse(data);
    const t = msg?.result?.transcription;
    if (t) console.log(t.isFinal ? '[FINAL]' : '[partial]', t.transcript);
    if (t?.isFinal) {
        ws.send(JSON.stringify({ closeStream: {} }));
        ws.close();
    }
});

FAQs

Inworld STT is multilingual, with language support depending on the underlying STT model chosen. Whisper Large v3 supports 100+ languages, while AssemblyAI's Multilingual Universal-Streaming model supports six languages: English, Spanish, French, German, Italian, and Portuguese.

Rates for all models are available here.

Inworld STT supports both real-time bidirectional streaming over WebSocket for live audio and synchronous transcription for complete audio files.

Inworld STT integrates seamlessly into the Inworld Realtime API, allowing you to easily create and deploy end-to-end, realtime voice pipelines.

Yes. Inworld supports Zero Data Retention (ZDR) across various models and providers available through the STT API. With ZDR, audio and transcription data are processed in real time and never stored. Visit our Security page to learn more.