Get started
Inworld STT

Speech-to-text that truly understands your users in realtime

Realtime streaming recognition with diarization, custom vocabularies, and voice profiling. Built for interactive audio applications.
<100ms
Latency
5
Voice Profile Signals
100+
Languages

Choose your endpoint.

Stream audio in real time, transcribe complete files, or extract voice profile signals — all through one unified API.

  • Realtime bidirectional streaming over WebSocket
  • Synchronous transcription for complete audio files
  • Voice Profile signals on every streaming chunk
  • Multi-provider support via a single model ID
wscat -c 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional' \ -H "Authorization: Basic $INWORLD_API_KEY" # Send config as first message: {
"transcribeConfig": { "modelId": "inworld/inworld-stt-1",
"audioEncoding": "LINEAR16", "sampleRateHertz": 16000, "language": "en-US", "voiceProfileConfig": { "enableVoiceProfile": true } } }

Understand user context to engage more effectively.

Every voice interaction builds a realtime profile of who is speaking. Emotion, accent, age, vocal style, and language — extracted from raw audio and updated with every chunk.

  • 5 voice profile signals per audio chunk
  • Confidence scores for every classification
  • Configurable threshold to filter low-confidence results
  • Feed signals into Router or TTS for adaptive responses
Test out Profiling
Voice profile signals
Emotion
Frustrated
84%
Age
Adult
84%
Accent
British
84%
Pitch
High
84%
Vocal Style
Shouting
84%
More signals coming soon

Realtime speech recognition, built for production.

Low-latency streaming over WebSocket with semantic VAD, word-level timestamps, speaker diarization (coming soon), and custom vocabulary. A single unified API across industry-leading transcription providers.

  • Bidirectional WebSocket streaming for live audio
  • Semantic & acoustic VAD detects intent, not just silence
  • Word-level timestamps and speaker diarization (coming soon)
  • Custom vocabulary to boost domain-specific terms
  • Unified API across 6+ models from multiple providers
Test out Models
One API, 6+ models
Provider
Model
Inworld
inworld-stt-1
Groq
whisper-large-v3
AssemblyAI
universal-streaming-multilingual
AssemblyAI
universal-streaming-english
AssemblyAI
u3-rt-pro
AssemblyAI
whisper-rt

Fully multilingual. One API, any language.

One STT API to access any language and benchmark-leading quality. Whether you specialize in one predominant language or need 100+ languages available at your fingertips.

  • Choose from models supporting up to 100+ languages
  • Realtime streaming in English, Spanish, French, German, Italian, and Portuguese
  • Voice profiling available across all models
  • Switch providers and languages with a single parameter
Test out Languages
100+
languages
🇬🇧English🇪🇸Español🇨🇳中文🇮🇳हिन्दी🇯🇵日本語🇰🇷한국어🇫🇷Français🇩🇪Deutsch
and many more

Designed for realtime interactive audio.

Every feature is designed for low-latency, high-accuracy speech recognition in production voice applications.

  • Realtime bidirectional streaming over WebSocket
  • Semantic & acoustic VAD for natural turn-taking
  • Unified multi-provider API with consistent auth and formatting
  • High accuracy with custom vocabulary boosting
  • Word-level timestamps and speaker diarization (coming soon)
  • Voice & context profiling for user-aware responses
Test out Latency
~100ms streaming latency
~0ms
Inworld STT
~0ms
OpenAI Whisper
~0ms
Google Cloud

Use cases

Inworld STT powers any application where understanding speech in realtime is critical.

Get started

Integrate Inworld STT with a few lines of code. Choose between realtime streaming over WebSocket or batch transcription.
import asyncio import base64 import json import wave import websockets API_KEY = "<YOUR_API_KEY>" WS_URL = "wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional" async def stream_transcribe(): headers = {"Authorization": f"Basic {API_KEY}"} async with websockets.connect(WS_URL, additional_headers=headers) as ws: # Read WAV and extract raw PCM with wave.open("audio.wav", "rb") as wf: sample_rate = wf.getframerate() channels = wf.getnchannels() pcm = wf.readframes(wf.getnframes()) # 1. Send transcription config await ws.send(json.dumps({ "transcribeConfig": { "modelId": "inworld/inworld-stt-1", "audioEncoding": "LINEAR16", "sampleRateHertz": 16000, "numberOfChannels": 1, "language": "en-US" } })) # 2. Stream audio in 100 ms chunks (base64-encoded) chunk_bytes = int(sample_rate * 2 * channels * 0.1) for i in range(0, len(pcm), chunk_bytes): chunk = pcm[i : i + chunk_bytes] await ws.send(json.dumps({ "audioChunk": {"content": base64.b64encode(chunk).decode()} })) await asyncio.sleep(0.1) # 3. Signal end of turn await ws.send(json.dumps({"endTurn": {}})) # 4. Receive results until final while True: try: raw = await asyncio.wait_for(ws.recv(), timeout=10) msg = json.loads(raw) t = msg.get("result", {}).get("transcription", {}) if t: tag = "[FINAL]" if t.get("isFinal") else "[partial]" print(f"{tag} {t.get('transcript', '')}") if t.get("isFinal"): break except asyncio.TimeoutError: break # 5. Close the stream await ws.send(json.dumps({"closeStream": {}})) asyncio.run(stream_transcribe())

FAQs

Inworld STT is multilingual, with language support depending on the underlying STT model chosen. Whisper Large v3 supports 100+ languages, while AssemblyAI's Multilingual Universal-Streaming model supports six languages: English, Spanish, French, German, Italian, and Portuguese.
Rates for all models are available here.
Inworld STT supports both real-time bidirectional streaming over WebSocket for live audio and synchronous transcription for complete audio files.
Inworld STT integrates seamlessly into the Inworld Realtime API, allowing you to easily create and deploy end-to-end, realtime voice pipelines.
Yes. Inworld supports Zero Data Retention (ZDR) across various models and providers available through the STT API. With ZDR, audio and transcription data are processed in real time and never stored. Visit our Security page to learn more.

Start building

Join millions of developers building the next wave of AI applications.
Copyright © 2021-2026 Inworld AI
Speech-to-Text: Realtime Streaming STT API