Get started
Realtime STT

Speech-to-text that truly understands your users in realtime

Realtime streaming recognition with diarization, custom vocabularies, and voice profiling. Built for interactive audio applications.
<100ms
Latency
5
Voice Profile Signals
100+
Languages
wscat -c 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional' \ -H "Authorization: Basic $INWORLD_API_KEY" # Send config as first message: {
"transcribeConfig": { "modelId": "inworld/inworld-stt-1",
"audioEncoding": "LINEAR16", "sampleRateHertz": 16000, "language": "en-US", "voiceProfileConfig": { "enableVoiceProfile": true } } }

Choose your endpoint.

Stream audio in real time, transcribe complete files, or extract voice profile signals — all through one unified API.

  • Realtime bidirectional streaming over WebSocket
  • Synchronous transcription for complete audio files
  • Voice Profile signals on every streaming chunk
  • Multi-provider support via a single model ID

Understand users and their context. Engage them most effectively

Realtime profiling extracts emotion, accent, age, vocal style, environment, and language from every voice interaction. Updated with every audio chunk.
Voice profile signals
Emotion
Frustrated
84%
Age
Adult
84%
Accent
British
84%
Pitch
High
84%
Vocal Style
Shouting
84%
More signals coming soon
Emotion detectionHappy, calm, angry, frustrated, neutral, tender. Adapt tone and routing based on how the user feels in realtime.
Language & accenten-US, en-GB, en-IN, es-419, and more. Regional accent classification with confidence scores for each chunk.
Age & vocal pitchKid, young, adult, or old. High, mid, or low pitch. Personalize content and voice selection for the audience.
Vocal style & environmentNormal, whispering, shouting, laughing, singing. Quiet room, busy street, car. Classify both how and where they speak.
→ Condition RouterPass profile fields as metadata to Router. CEL conditions route by emotion, language, or tier. For example, frustrated users get empathetic models, Spanish speakers get multilingual providers.
→ Condition TTSUse profile signals as steering parameters for TTS. Calm users get a direct tone. Frustrated users hear empathetic pacing. Vocal pitch informs voice selection automatically.

Built for realtime interactive audio

Every feature is designed for low-latency, high-accuracy speech recognition in production voice applications.

Realtime streaming

Realtime, bidirectional streaming over WebSocket for live audio, or synchronous transcription for complete audio files.

Semantic & acoustic VAD

Automatically detect when speech starts and stops. Enable natural speech patterns.

Voice & context profiling

Understand the profile, context and state of your users to contextualize responses.

Unified multi-provider API

A single integration point for industry-leading, high-accuracy transcription providers, with consistent authentication, request formatting, and response handling.

High accuracy & custom vocabulary

Transcribe audio with industry-leading accuracy. Add domain-specific terms, product names, and specialized vocabulary to boost recognition further.

Word-level timestamps & diarization

Per-word timing for subtitles and search. Label speakers in multi-party conversations.

Get started

Integrate Realtime STT with a few lines of code. Choose between realtime streaming over WebSocket or batch transcription.
import asyncio import base64 import json import wave import websockets API_KEY = "<YOUR_API_KEY>" WS_URL = "wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional" async def stream_transcribe(): headers = {"Authorization": f"Basic {API_KEY}"} async with websockets.connect(WS_URL, additional_headers=headers) as ws: # Read WAV and extract raw PCM with wave.open("audio.wav", "rb") as wf: sample_rate = wf.getframerate() channels = wf.getnchannels() pcm = wf.readframes(wf.getnframes()) # 1. Send transcription config await ws.send(json.dumps({ "transcribeConfig": { "modelId": "inworld/inworld-stt-1", "audioEncoding": "LINEAR16", "sampleRateHertz": 16000, "numberOfChannels": 1, "language": "en-US" } })) # 2. Stream audio in 100 ms chunks (base64-encoded) chunk_bytes = int(sample_rate * 2 * channels * 0.1) for i in range(0, len(pcm), chunk_bytes): chunk = pcm[i : i + chunk_bytes] await ws.send(json.dumps({ "audioChunk": {"content": base64.b64encode(chunk).decode()} })) await asyncio.sleep(0.1) # 3. Signal end of turn await ws.send(json.dumps({"endTurn": {}})) # 4. Receive results until final while True: try: raw = await asyncio.wait_for(ws.recv(), timeout=10) msg = json.loads(raw) t = msg.get("result", {}).get("transcription", {}) if t: tag = "[FINAL]" if t.get("isFinal") else "[partial]" print(f"{tag} {t.get('transcript', '')}") if t.get("isFinal"): break except asyncio.TimeoutError: break # 5. Close the stream await ws.send(json.dumps({"closeStream": {}})) asyncio.run(stream_transcribe())

FAQs

Realtime STT is multilingual, with language support depending on the underlying STT model chosen. Whisper Large v3 supports 100+ languages, while AssemblyAI's Multilingual Universal-Streaming model supports six languages: English, Spanish, French, German, Italian, and Portuguese.
Rates for all models are available here.
Realtime STT supports both real-time bidirectional streaming over WebSocket for live audio and synchronous transcription for complete audio files.
Realtime STT integrates seamlessly into the Realtime API, allowing you to easily create and deploy end-to-end, realtime voice pipelines.
Yes. Inworld supports Zero Data Retention (ZDR) across various models and providers available through the STT API. With ZDR, audio and transcription data are processed in real time and never stored. Visit our Security page to learn more.

Start building

Join millions of developers building the next wave of AI applications.
Copyright © 2021-2026 Inworld AI
Speech-to-Text: Realtime Streaming STT API