Audio-to-Text

Audio-to-text that understands your users

Your agent hears the tone, not just the words. Every transcript comes back with a voice profile (emotion, age, accent, pitch, vocal style) so your app knows who's talking. Multi-provider, multilingual, streaming out of the box.

Transcribe audio Read the docs

Transcription + profile

Input

modelId inworld/inworld-stt-1language en-US

Streaming audio · 16kHz · 98KB/s

Output

profile.emotion stressed (0.87)profile.accent American (0.84)profile.vocalStyle tentative (0.78)

I've been putting this off for months. I don't even know where to start.

Works with

STT Realtime API Router

More than words. Context your pipeline can use.

Your agent hears the tone, transcribes every language your users actually speak, and streams fast enough to feel live.

Voice profile, not just words

Know who's talking, not just what they said.

Every transcript comes back with a voice profile: emotion, age, accent, pitch, vocal style, each with a confidence score. Signal commodity transcription throws away.

Voice profile, alongside the transcript

inworld-stt-1

Emotion

stressed

Age

30s

Accent

British (RP)

Vocal style

measured

Pitch

medium-low

Voice profile with confidence scores. Not just a transcript.

Voice profile, not just words

Know who's talking, not just what they said.

Every transcript comes back with a voice profile: emotion, age, accent, pitch, vocal style, each with a confidence score. Signal commodity transcription throws away.

Voice profile, alongside the transcript

inworld-stt-1

Emotion

stressed

Age

30s

Accent

British (RP)

Vocal style

measured

Pitch

medium-low

Voice profile with confidence scores. Not just a transcript.

Multi-provider, one API

Pick the engine that fits the call.

Realtime STT-1 for voice profiling. Whisper Large v3 for 99+ languages. AssemblyAI for specialized accuracy. Switch engines with one modelId field. Your code stays the same.

Four engines, one modelId field

Engine

Strength

Langs

Realtime STT-1

Voice profile

Whisper Large v3

Language reach

99+

AssemblyAI Universal

Streaming

AssemblyAI U3-RT-Pro

Accuracy

Four engines, one modelId field

Engine

Strength

Langs

Realtime STT-1

Voice profile

Whisper Large v3

Language reach

99+

AssemblyAI Universal

Streaming

AssemblyAI U3-RT-Pro

Accuracy

Multi-provider, one API

Pick the engine that fits the call.

Realtime STT-1 for voice profiling. Whisper Large v3 for 99+ languages. AssemblyAI for specialized accuracy. Switch engines with one modelId field. Your code stays the same.

99+ languages

Transcribe every language your users actually speak.

Whisper Large v3 on Inworld infrastructure gives you 99+ languages out of the box. AssemblyAI adds six high-accuracy pipelines when precision matters more than breadth.

Whisper large v3 · Inworld infrastructure

99+

Languages transcribed

AssemblyAI covers 6 extra-high-accuracy languages on top

99+ languages

Transcribe every language your users actually speak.

Whisper Large v3 on Inworld infrastructure gives you 99+ languages out of the box. AssemblyAI adds six high-accuracy pipelines when precision matters more than breadth.

Whisper large v3 · Inworld infrastructure

99+

Languages transcribed

AssemblyAI covers 6 extra-high-accuracy languages on top

Realtime streaming

Works with

Realtime API

Partial hypotheses as they land, finals as they settle.

WebSocket streaming returns partial transcripts under a second, then finals as the speaker lands on phrases. Enough to drive a live UI or an interruption-aware agent.

Realtime streaming

WebSocket

partial · 0.4s

I was hoping to move my reservat—

partial · 0.9s

I was hoping to move my reservation to

final · 1.6s

I was hoping to move my reservation to eight o'clock.

Partial hypotheses, then final transcripts. Enough to drive a live UI.

Realtime streaming

WebSocket

partial · 0.4s

I was hoping to move my reservat—

partial · 0.9s

I was hoping to move my reservation to

final · 1.6s

I was hoping to move my reservation to eight o'clock.

Partial hypotheses, then final transcripts. Enough to drive a live UI.

Realtime streaming

Works with

Realtime API

Partial hypotheses as they land, finals as they settle.

WebSocket streaming returns partial transcripts under a second, then finals as the speaker lands on phrases. Enough to drive a live UI or an interruption-aware agent.

Feeds the rest of your stack

Works with

Router

Realtime API

TTS

The only STT that makes the Router smarter and the TTS more appropriate.

Voice profile flows into Router context so the LLM reasons with how the user sounds. Router emits TTS steering that adapts the response tone. Context no other STT delivers.

See the Realtime API

Profile flows into the pipeline

STT

Profile + transcript

Router

Context-aware routing

TTS

Tone-matched response

Standalone STT gives you words. Realtime STT gives your stack context.

Feeds the rest of your stack

Works with

Router

Realtime API

TTS

The only STT that makes the Router smarter and the TTS more appropriate.

Voice profile flows into Router context so the LLM reasons with how the user sounds. Router emits TTS steering that adapts the response tone. Context no other STT delivers.

See the Realtime API

Profile flows into the pipeline

STT

Profile + transcript

Router

Context-aware routing

TTS

Tone-matched response

Standalone STT gives you words. Realtime STT gives your stack context.

Production-ready

Already transcribing where it matters.

Voice agents, call-center pipelines, and Talkpal-style learning apps run on the same multi-provider STT. SOC 2 Type II, GDPR, on-premise available for regulated environments.

Where it runs

Voice agents

Realtime API

Call-center transcription

SIP + diarization

Language learning

Talkpal

Meeting intelligence

Async batch

Where it runs

Voice agents

Realtime API

Call-center transcription

SIP + diarization

Language learning

Talkpal

Meeting intelligence

Async batch

Production-ready

Already transcribing where it matters.

Voice agents, call-center pipelines, and Talkpal-style learning apps run on the same multi-provider STT. SOC 2 Type II, GDPR, on-premise available for regulated environments.

Transcribe in three lines, profile included

Base64 the audio, call the endpoint, unpack transcript + voice profile. Streaming mode returns partials in under a second.

import fs from 'fs';

const audio = fs.readFileSync('call.wav').toString('base64');

const resp = await fetch('https://api.inworld.ai/stt/v1/transcribe', {
  method: 'POST',
  headers: {
    Authorization: `Basic ${process.env.INWORLD_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    transcribeConfig: {
      modelId: 'inworld/inworld-stt-1',
      audioEncoding: 'AUTO_DETECT',
      language: 'en-US',
    },
    audioData: { content: audio },
  }),
});

const { transcription, voiceProfile } = await resp.json();
// transcription.transcript -> string
// voiceProfile -> { emotion, age, accent, pitch, vocalStyle } with confidence scores

import fs from 'fs';

const audio = fs.readFileSync('call.wav').toString('base64');

const resp = await fetch('https://api.inworld.ai/stt/v1/transcribe', {
  method: 'POST',
  headers: {
    Authorization: `Basic ${process.env.INWORLD_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    transcribeConfig: {
      modelId: 'inworld/inworld-stt-1',
      audioEncoding: 'AUTO_DETECT',
      language: 'en-US',
    },
    audioData: { content: audio },
  }),
});

const { transcription, voiceProfile } = await resp.json();
// transcription.transcript -> string
// voiceProfile -> { emotion, age, accent, pitch, vocalStyle } with confidence scores

Prefer clicking? Transcribe in the playground.

Open the STT Playground, pick an engine, upload a clip or record from the browser, and see the transcript plus the voice profile land in real time. Copy the config when the output looks right.

Open the playground

FAQ

The voice profile captures tone, age, accent, pitch, and vocal style on every utterance, each returned with a confidence score alongside the transcript. Only Realtime STT-1 returns voice profiles; other models (Whisper, AssemblyAI) return transcripts only.

Whisper Large v3 on Inworld infrastructure covers 99+ languages. AssemblyAI's streaming models cover six high-accuracy languages (English, Spanish, French, German, Italian, Portuguese). Realtime STT-1 is English-only at launch.

Four engines through one API: Realtime STT-1 (voice profiling), Groq Whisper Large v3 (99+ languages), AssemblyAI Universal-Streaming (multilingual), and AssemblyAI U3-RT-Pro (highest accuracy). Switch by changing the modelId field; your code stays the same.

Yes. The WebSocket streaming endpoint (wss://.../stt/v1/transcribe:streamBidirectional) returns partial hypotheses in under a second and final transcripts as the speaker lands on phrases. Ideal for live captions, interruption-aware agents, and conversational UIs.

Yes. In the Realtime API, the voice profile from STT flows into Router context automatically. The LLM reasons with the user's emotional state, and Router can emit TTS steering instructions that adapt the response tone. This cross-layer context carry is unique to the Inworld pipeline.

Sync accepts LINEAR16 (PCM), MP3, OGG_OPUS, FLAC, and AUTO_DETECT. Streaming WebSocket accepts LINEAR16 (PCM) and AUTO_DETECT. Recommended specs: 16,000 Hz, 16-bit, mono.

GDPR and SOC 2 Type II compliant. On-premise deployment available for regulated environments — contact sales.

Realtime STT is in research preview. The APIs are stable enough to build on; features and pricing may adjust as the product matures.