Get started
Realtime STT

Speech-to-text that truly understands your users in realtime

Realtime streaming recognition with voice profiling — emotion, vocal style, accent, age, and pitch extracted from raw audio. Feed signals straight into your LLM and TTS for adaptive, expressive responses.
<100ms
Latency
5
Voice Profile Signals
100+
Languages

Choose your endpoint.

Stream audio in real time, transcribe complete files, or extract voice profile signals — all through one unified API.

  • Realtime bidirectional streaming over WebSocket
  • Synchronous transcription for complete audio files
  • Voice Profile signals on every streaming chunk
  • Multi-provider support via a single model ID
wscat -c 'wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional' \ -H "Authorization: Basic $INWORLD_API_KEY" # Send config as first message: {
"transcribeConfig": { "modelId": "inworld/inworld-stt-1",
"audioEncoding": "LINEAR16", "sampleRateHertz": 16000, "language": "en-US", "voiceProfileConfig": { "enableVoiceProfile": true } } }

Voice profiling hears who's speaking, not just their words.

Every audio chunk produces a realtime profile of the speaker: emotion, vocal style, accent, age, and pitch — extracted from raw audio with confidence scores. The signal that turns a transcript into context your LLM and TTS can act on.

  • 5 paralinguistic signals per audio chunk, with confidence scores
  • Configurable threshold to filter low-confidence results
  • Feeds into LLM context and Realtime TTS-2 steering downstream
  • Available on the inworld/inworld-stt-1 model
Test out Profiling
Voice profile signals
Emotion
Frustrated
84%
Age
Adult
84%
Accent
British
84%
Pitch
High
84%
Vocal Style
Shouting
84%
More signals coming soon

Voice profile steers Realtime TTS-2 in realtime.

Voice profile signals flow into the LLM as context. The LLM emits Realtime TTS-2 steering tags and non-verbals inline, and Realtime TTS-2 renders an expressive response: natural pacing, soft delivery, and a real sigh, all driven by the user's voice profile.

  • Voice profile drops into LLM context as structured metadata
  • LLM emits inline steering tags like [Speak softly] and non-verbals like [sigh] [breathe]
  • Realtime TTS-2 renders the markup as natural, expressive audio
  • Wired end-to-end through the Realtime API
Test Out Realtime
1. User audio → STT voice profile
emotion: sad · style: soft · pitch: low
2. LLM response
[Speak softly] I'm so sorry to hear that. [sigh] Let's figure this out together.
3. Realtime TTS-2 expressive output
voice: Sarah · model: inworld-tts-2

Realtime speech recognition, built for production.

Low-latency streaming over WebSocket with semantic VAD, word-level timestamps, speaker diarization (coming soon), and custom vocabulary. A single unified API across industry-leading transcription providers.

  • Bidirectional WebSocket streaming for live audio
  • Semantic & acoustic VAD detects intent, not just silence
  • Word-level timestamps and speaker diarization (coming soon)
  • Custom vocabulary to boost domain-specific terms
  • Unified API across 6+ models from multiple providers
Test out Models
One API, 6+ models
Provider
Model
Inworld
inworld-stt-1
Groq
whisper-large-v3
AssemblyAI
universal-streaming-multilingual
AssemblyAI
universal-streaming-english
AssemblyAI
u3-rt-pro
AssemblyAI
whisper-rt

Fully multilingual. One API, any language.

One STT API to access any language and benchmark-leading quality. Whether you specialize in one predominant language or need over 100 languages available at your fingertips.

  • Choose from models supporting up to over 100 languages
  • Realtime streaming in English, Spanish, French, German, Italian, and Portuguese
  • Voice profiling available across all models
  • Switch providers and languages with a single parameter
Test out Languages
100+
languages
🇬🇧English🇪🇸Español🇨🇳中文🇮🇳हिन्दी🇯🇵日本語🇰🇷한국어🇫🇷Français🇩🇪Deutsch
and many more

Designed for realtime interactive audio.

Every feature is designed for low-latency, high-accuracy speech recognition in production voice applications.

  • Realtime bidirectional streaming over WebSocket
  • Semantic & acoustic VAD for natural turn-taking
  • Unified multi-provider API with consistent auth and formatting
  • High accuracy with custom vocabulary boosting
  • Word-level timestamps and speaker diarization (coming soon)
  • Voice & context profiling for user-aware responses
Test out Latency
~100ms streaming latency
~0ms
Realtime STT
~0ms
OpenAI Whisper
~0ms
Google Cloud

Use cases

Realtime STT powers any application where understanding speech in realtime is critical.

Get started

Integrate Realtime STT with a few lines of code. Choose between realtime streaming over WebSocket or batch transcription.
import asyncio import base64 import json import wave import websockets API_KEY = "<YOUR_API_KEY>" WS_URL = "wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional" async def stream_transcribe(): headers = {"Authorization": f"Basic {API_KEY}"} async with websockets.connect(WS_URL, additional_headers=headers) as ws: # Read WAV and extract raw PCM with wave.open("audio.wav", "rb") as wf: sample_rate = wf.getframerate() channels = wf.getnchannels() pcm = wf.readframes(wf.getnframes()) # 1. Send transcription config await ws.send(json.dumps({ "transcribeConfig": { "modelId": "inworld/inworld-stt-1", "audioEncoding": "LINEAR16", "sampleRateHertz": 16000, "numberOfChannels": 1, "language": "en-US" } })) # 2. Stream audio in 100 ms chunks (base64-encoded) chunk_bytes = int(sample_rate * 2 * channels * 0.1) for i in range(0, len(pcm), chunk_bytes): chunk = pcm[i : i + chunk_bytes] await ws.send(json.dumps({ "audioChunk": {"content": base64.b64encode(chunk).decode()} })) await asyncio.sleep(0.1) # 3. Signal end of turn await ws.send(json.dumps({"endTurn": {}})) # 4. Receive results until final while True: try: raw = await asyncio.wait_for(ws.recv(), timeout=10) msg = json.loads(raw) t = msg.get("result", {}).get("transcription", {}) if t: tag = "[FINAL]" if t.get("isFinal") else "[partial]" print(f"{tag} {t.get('transcript', '')}") if t.get("isFinal"): break except asyncio.TimeoutError: break # 5. Close the stream await ws.send(json.dumps({"closeStream": {}})) asyncio.run(stream_transcribe())

FAQs

Voice profiling extracts five paralinguistic signals from raw audio on every streaming chunk: emotion (happy, angry, sad, frustrated, calm, surprised, fearful, tender), vocal style (shouting, whispering, laughing, crying, singing, monotone, mumbling), accent, age (kid, young, adult, old), and pitch (high, medium, low). Each signal includes a confidence score, and a configurable threshold filters low-confidence results. Available on the inworld/inworld-stt-1 model.
Voice profile signals can be passed straight into the Realtime API as LLM context — for example, the user sounds sad and is speaking softly. The LLM can then emit Realtime TTS-2 steering instructions and non-verbals inline ([Speak softly] I'm so sorry [sigh]), which Realtime TTS-2 renders as expressive, context-aware audio. The full STT → LLM → TTS pipeline runs end-to-end in the Realtime API with no extra plumbing.
Realtime STT is multilingual, with language support depending on the underlying STT model chosen. Whisper Large v3 supports over 100 languages, while AssemblyAI's Multilingual Universal-Streaming model supports six languages: English, Spanish, French, German, Italian, and Portuguese.
Rates for all models are available here.
Realtime STT supports both real-time bidirectional streaming over WebSocket for live audio and synchronous transcription for complete audio files.
Realtime STT integrates seamlessly into the Realtime API, allowing you to easily create and deploy end-to-end, realtime voice pipelines.
Yes. Inworld supports Zero Data Retention (ZDR) across various models and providers available through the STT API. With ZDR, audio and transcription data are processed in real time and never stored. Visit our Security page to learn more.

Start building

Join millions of developers building the next wave of AI applications.
Copyright © 2021-2026 Inworld AI
Speech-to-Text: Realtime Streaming STT API