Get started
Inworld STT

Speech-to-text that truly understands your users in realtime

Realtime streaming recognition with diarization, custom vocabularies, and voice profiling. Built for interactive audio applications.
Live transcription
Customer
lang en-USage adultemotion frustratedstyle shouting

I've been looking all through my account and can't figure out why I was charged twice on order CX-204.

Agent
lang en-USage adultemotion calmstyle reassuring

I can see the duplicate charge on your account. I'm processing a refund right now. Sorry for the frustration.

Understand users and their context. Engage them most effectively

Realtime profiling extracts emotion, accent, age, vocal style, environment, and language from every voice interaction. Updated with every audio chunk.

Voice profile signals
Emotion
Frustrated
84%
Age
Adult
84%
Accent
British
84%
Pitch
High
84%
Vocal Style
Shouting
84%
More signals coming soon
Emotion detectionHappy, calm, angry, frustrated, neutral, tender. Adapt tone and routing based on how the user feels in realtime.
Language & accenten-US, en-GB, en-IN, es-419, and more. Regional accent classification with confidence scores for each chunk.
Age & vocal pitchKid, young, adult, or old. High, mid, or low pitch. Personalize content and voice selection for the audience.
Vocal style & environmentNormal, whispering, shouting, laughing, singing. Quiet room, busy street, car. Classify both how and where they speak.
→ Condition RouterPass profile fields as metadata to Router. CEL conditions route by emotion, language, or tier. For example, frustrated users get empathetic models, Spanish speakers get multilingual providers.
→ Condition TTSUse profile signals as steering parameters for TTS. Calm users get a direct tone. Frustrated users hear empathetic pacing. Vocal pitch informs voice selection automatically.

Built for realtime interactive audio

Every feature is designed for low-latency, high-accuracy speech recognition in production voice applications.

Realtime streaming

Realtime, bidirectional streaming over WebSocket for live audio, or synchronous transcription for complete audio files.

Semantic & acoustic VAD

Automatically detect when speech starts and stops. Enable natural speech patterns.

Voice & context profiling

Understand the profile, context and state of your users to contextualize responses.

Unified multi-provider API

A single integration point for industry-leading, high-accuracy transcription providers, with consistent authentication, request formatting, and response handling.

High accuracy & custom vocabulary

Transcribe audio with industry-leading accuracy. Add domain-specific terms, product names, and specialized vocabulary to boost recognition further.

Word-level timestamps & diarization

Per-word timing for subtitles and search. Label speakers in multi-party conversations.

Get started

Integrate Inworld STT with a few lines of code. Choose between realtime streaming over WebSocket or batch transcription.
import asyncio import base64 import json import wave import websockets API_KEY = "<YOUR_API_KEY>" WS_URL = "wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional" async def stream_transcribe(): headers = {"Authorization": f"Basic {API_KEY}"} async with websockets.connect(WS_URL, additional_headers=headers) as ws: # Read WAV and extract raw PCM with wave.open("audio.wav", "rb") as wf: sample_rate = wf.getframerate() channels = wf.getnchannels() pcm = wf.readframes(wf.getnframes()) # 1. Send transcription config await ws.send(json.dumps({ "transcribeConfig": { "modelId": "inworld/inworld-stt-1", "audioEncoding": "LINEAR16", "sampleRateHertz": 16000, "numberOfChannels": 1, "language": "en-US" } })) # 2. Stream audio in 100 ms chunks (base64-encoded) chunk_bytes = int(sample_rate * 2 * channels * 0.1) for i in range(0, len(pcm), chunk_bytes): chunk = pcm[i : i + chunk_bytes] await ws.send(json.dumps({ "audioChunk": {"content": base64.b64encode(chunk).decode()} })) await asyncio.sleep(0.1) # 3. Signal end of turn await ws.send(json.dumps({"endTurn": {}})) # 4. Receive results until final while True: try: raw = await asyncio.wait_for(ws.recv(), timeout=10) msg = json.loads(raw) t = msg.get("result", {}).get("transcription", {}) if t: tag = "[FINAL]" if t.get("isFinal") else "[partial]" print(f"{tag} {t.get('transcript', '')}") if t.get("isFinal"): break except asyncio.TimeoutError: break # 5. Close the stream await ws.send(json.dumps({"closeStream": {}})) asyncio.run(stream_transcribe())

FAQs

Inworld STT is multilingual, with language support depending on the underlying STT model chosen. Whisper Large v3 supports 100+ languages, while AssemblyAI's Multilingual Universal-Streaming model supports six languages: English, Spanish, French, German, Italian, and Portuguese.
Rates for all models are available here.
Inworld STT supports both real-time bidirectional streaming over WebSocket for live audio and synchronous transcription for complete audio files.
Inworld STT integrates seamlessly into the Inworld Realtime API, allowing you to easily create and deploy end-to-end, realtime voice pipelines.
Yes. Inworld supports Zero Data Retention (ZDR) across various models and providers available through the STT API. With ZDR, audio and transcription data are processed in real time and never stored. Visit our Security page to learn more.

Start building

Join millions of developers building the next wave of AI applications.
Copyright © 2021-2026 Inworld AI
Speech-to-Text: Realtime Streaming STT API