Speech-to-text that truly understands your users in realtime.

Realtime streaming recognition with diarization, custom vocabularies, and voice profiling. Built for interactive audio applications.
Live transcription
Learner
lang fr-FRgender femalestyle hesitantage young

Je voudrais réserver une table pour deux personnes, s'il vous plaît.

Tutor
lang en-USgender malestyle encouragingage old

Excellent pronunciation! Try softening the 'r' in réserver. Let's hear it one more time.

Built for realtime interactive audio

Every feature is designed for low-latency, high-accuracy speech recognition in production voice applications.

Realtime streaming

Realtime, bidirectional streaming over WebSocket for live audio, or synchronous transcription for complete audio files.

Semantic & acoustic VAD

Automatically detect when speech starts and stops. Enable natural speech patterns.

Voice & context profiling

Understand the profile, context and state of your users to contextualize responses.

Unified multi-provider API

A single integration point for industry-leading, high-accuracy transcription providers, with consistent authentication, request formatting, and response handling.

High accuracy & custom vocabulary

Transcribe audio with industry-leading accuracy. Add domain-specific terms, product names, and specialized vocabulary to boost recognition further.

Word-level timestamps & diarization

Per-word timing for subtitles and search. Label speakers in multi-party conversations.

Get started

Integrate Inworld STT with a few lines of code. Choose between realtime streaming over WebSocket or batch transcription.
import asyncio import base64 import json import wave import websockets API_KEY = "<YOUR_API_KEY>" WS_URL = "wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional" async def stream_transcribe(): headers = {"Authorization": f"Basic {API_KEY}"} async with websockets.connect(WS_URL, additional_headers=headers) as ws: # Read WAV and extract raw PCM with wave.open("audio.wav", "rb") as wf: sample_rate = wf.getframerate() channels = wf.getnchannels() pcm = wf.readframes(wf.getnframes()) # 1. Send transcription config await ws.send(json.dumps({ "transcribeConfig": { "modelId": "assemblyai/universal-streaming-multilingual", "audioEncoding": "LINEAR16", "sampleRateHertz": 16000, "numberOfChannels": 1, "language": "en-US" } })) # 2. Stream audio in 100 ms chunks (base64-encoded) chunk_bytes = int(sample_rate * 2 * channels * 0.1) for i in range(0, len(pcm), chunk_bytes): chunk = pcm[i : i + chunk_bytes] await ws.send(json.dumps({ "audioChunk": {"content": base64.b64encode(chunk).decode()} })) await asyncio.sleep(0.1) # 3. Signal end of turn await ws.send(json.dumps({"endTurn": {}})) # 4. Receive results until final while True: try: raw = await asyncio.wait_for(ws.recv(), timeout=10) msg = json.loads(raw) t = msg.get("result", {}).get("transcription", {}) if t: tag = "[FINAL]" if t.get("isFinal") else "[partial]" print(f"{tag} {t.get('transcript', '')}") if t.get("isFinal"): break except asyncio.TimeoutError: break # 5. Close the stream await ws.send(json.dumps({"closeStream": {}})) asyncio.run(stream_transcribe())

FAQ

Inworld STT is multilingual, with language support depending on the underlying STT model chosen. Whisper Large v3 supports 99 languages, while AssemblyAI's Multilingual Universal-Streaming model supports six languages: English, Spanish, French, German, Italian, and Portuguese.
While Inworld STT is in Research Preview, you pay provider rates directly, with no markup or margin added. Rates for all models are available here.
Inworld STT supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.
Inworld STT integrates seamlessly into the Inworld Realtime API, allowing you to easily create and deploy end-to-end, realtime voice pipelines.

Start building

Join millions of developers building the next wave of AI applications.
Copyright © 2021-2026 Inworld AI