Get started
Audio-to-Text

Audio-to-text that understands your users

Your agent hears the tone, not just the words. Every transcript comes back with a voice profile (emotion, age, accent, pitch, vocal style) so your app knows who's talking. Multi-provider, multilingual, streaming out of the box.
Transcription + profile
Input
modelId inworld/inworld-stt-1language en-US

Streaming audio · 16kHz · 98KB/s

Output
profile.emotion stressed (0.87)profile.accent American (0.84)profile.vocalStyle tentative (0.78)

I've been putting this off for months. I don't even know where to start.

More than words. Context your pipeline can use.

Your agent hears the tone, transcribes every language your users actually speak, and streams fast enough to feel live.
Voice profile, not just words

Know who's talking, not just what they said.

Every transcript comes back with a voice profile: emotion, age, accent, pitch, vocal style, each with a confidence score. Signal commodity transcription throws away.
Voice profile, alongside the transcript
inworld-stt-1
Emotion
stressed
87
Age
30s
92
Accent
British (RP)
88
Vocal style
measured
81
Pitch
medium-low
90
Voice profile with confidence scores. Not just a transcript.
Multi-provider, one API

Pick the engine that fits the call.

Inworld STT-1 for voice profiling. Whisper Large v3 for 99+ languages. AssemblyAI for specialized accuracy. Switch engines with one modelId field. Your code stays the same.
Four engines, one modelId field
Engine
Strength
Langs
Inworld STT-1
Voice profile
EN
Whisper Large v3
Language reach
99+
AssemblyAI Universal
Streaming
6
AssemblyAI U3-RT-Pro
Accuracy
EN
99+ languages

Transcribe every language your users actually speak.

Whisper Large v3 on Inworld infrastructure gives you 99+ languages out of the box. AssemblyAI adds six high-accuracy pipelines when precision matters more than breadth.
Whisper large v3 · Inworld infrastructure
99+
Languages transcribed
AssemblyAI covers 6 extra-high-accuracy languages on top
Realtime streaming
Works with
Realtime API

Partial hypotheses as they land, finals as they settle.

WebSocket streaming returns partial transcripts under a second, then finals as the speaker lands on phrases. Enough to drive a live UI or an interruption-aware agent.
Realtime streaming
WebSocket
partial · 0.4s
I was hoping to move my reservat—
partial · 0.9s
I was hoping to move my reservation to
final · 1.6s
I was hoping to move my reservation to eight o'clock.
Partial hypotheses, then final transcripts. Enough to drive a live UI.
Feeds the rest of your stack

The only STT that makes the Router smarter and the TTS more appropriate.

Voice profile flows into Router context so the LLM reasons with how the user sounds. Router emits TTS steering that adapts the response tone. Context no other STT delivers.
See the Realtime API
Profile flows into the pipeline
STT
Profile + transcript
Router
Context-aware routing
TTS
Tone-matched response
Standalone STT gives you words. Inworld STT gives your stack context.
Production-ready

Already transcribing where it matters.

Voice agents, call-center pipelines, and Talkpal-style learning apps run on the same multi-provider STT. SOC 2 Type II, GDPR, on-premise available for regulated environments.
Where it runs
Voice agents
Realtime API
Call-center transcription
SIP + diarization
Language learning
Talkpal
Meeting intelligence
Async batch

Transcribe in three lines, profile included

Base64 the audio, call the endpoint, unpack transcript + voice profile. Streaming mode returns partials in under a second.
import fs from 'fs'; const audio = fs.readFileSync('call.wav').toString('base64'); const resp = await fetch('https://api.inworld.ai/stt/v1/transcribe', { method: 'POST', headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ transcribeConfig: { modelId: 'inworld/inworld-stt-1', audioEncoding: 'AUTO_DETECT', language: 'en-US', }, audioData: { content: audio }, }), }); const { transcription, voiceProfile } = await resp.json(); // transcription.transcript -> string // voiceProfile -> { emotion, age, accent, pitch, vocalStyle } with confidence scores

Prefer clicking? Transcribe in the playground.

Open the STT Playground, pick an engine, upload a clip or record from the browser, and see the transcript plus the voice profile land in real time. Copy the config when the output looks right.
Open the playground

FAQ

The voice profile captures tone, age, accent, pitch, and vocal style on every utterance, each returned with a confidence score alongside the transcript. Only Inworld STT-1 returns voice profiles; other models (Whisper, AssemblyAI) return transcripts only.
Whisper Large v3 on Inworld infrastructure covers 99+ languages. AssemblyAI's streaming models cover six high-accuracy languages (English, Spanish, French, German, Italian, Portuguese). Inworld STT-1 is English-only at launch.
Four engines through one API: Inworld STT-1 (voice profiling), Groq Whisper Large v3 (99+ languages), AssemblyAI Universal-Streaming (multilingual), and AssemblyAI U3-RT-Pro (highest accuracy). Switch by changing the modelId field; your code stays the same.
Yes. The WebSocket streaming endpoint (wss://.../stt/v1/transcribe:streamBidirectional) returns partial hypotheses in under a second and final transcripts as the speaker lands on phrases. Ideal for live captions, interruption-aware agents, and conversational UIs.
Yes. In the Realtime API, the voice profile from STT flows into Router context automatically. The LLM reasons with the user's emotional state, and Router can emit TTS steering instructions that adapt the response tone. This cross-layer context carry is unique to the Inworld pipeline.
Sync accepts LINEAR16 (PCM), MP3, OGG_OPUS, FLAC, and AUTO_DETECT. Streaming WebSocket accepts LINEAR16 (PCM) and AUTO_DETECT. Recommended specs: 16,000 Hz, 16-bit, mono.
GDPR and SOC 2 Type II compliant. On-premise deployment available for regulated environments — contact sales.
Inworld STT is in research preview. The APIs are stable enough to build on; features and pricing may adjust as the product matures.

Transcribe with context, not just words.

Voice profile included. Multi-provider. 99+ languages. Realtime streaming. Built to feed the pipeline.
Copyright © 2021-2026 Inworld AI
Audio-to-Text API: Transcription That Understands Your Users | Inworld AI