Get started
Emotional Understanding

Voice AI that hears how people feel

Your agent hears the tone, not just the words. Read how someone sounds on every utterance, then let that signal flow through the LLM and into the reply, so every response feels like it was actually listening.
Voice profile
Utterance
profile.emotion stressed (0.87)profile.style tentative (0.78)

I've been putting this off for months.

Agent
tts.tone calm, slowervoice Luna

Let's take a breath and pick one thing together.

Voice AI that actually sounds like it's listening.

Tone, age, accent, and mood travel with every utterance, and the reply adapts to match.
Every utterance, understood

Your agent reads who's speaking and how they feel, automatically.

Every transcript from STT-1 ships with a voice profile plus confidence scores. Threshold it, decay it, reason with it. Not a label. A signal.
Every utterance, understood
with confidence
Emotion
stressed
87
Age
30s
92
Accent
British (RP)
88
Vocal style
measured
81
Pitch
medium-low
90
Flows through the pipeline

Tone travels with every turn, from the mic to the answer.

The profile injects into Router context so the LLM sees stress. Router steers TTS to reply softer and slower. Standalone emotion APIs stop at a score.
Emotion flows through the pipeline
STT
Detects emotion
stressed · 0.87
Router
Reasons with it
"match user tone"
TTS
Responds in kind
softer · [sigh]
Standalone emotion APIs return a score. Ours becomes part of the conversation.
Respond in kind
Works with
Realtime APITTS

Your agent hears the tone, not just the words.

A stressed user gets a calm reply, not a checklist. The profile feeds the prompt directly. No separate emotion model. No glue code between layers.
User utterance
"I've been putting this off for months. I don't even know where to start."
emotion: stressed
pace: slow
agent responds: calm, short
Scored, not labeled

Signals you can act on, not just observe.

Every signal ships with a confidence score between 0 and 1. Threshold, decay over time, or require agreement across utterances. Never stuck with a binary label.
Confidence, not a label
5
Dimensions, scored per utterance
Threshold or decay on confidence. Your app, your rules.
Tracks shifts in real time
Works with
Realtime API

A call isn't a single emotion. It's a curve.

The profile updates per utterance. Watch a call move from anxious to relieved and have your agent behave differently at each beat. De-escalation falls out naturally.
Tracks shifts in real time
call timeline
0:00
neutral
0:12
anxious
0:18
stressed
0:24
relieved
0:30
calm
A call isn't a single emotion. It's a curve. Your agent can follow it.
Already in production
Works with
Realtime API

Already shipping in products people pay for.

Wellness apps match tone to user state. Support agents catch de-escalation. Companions feel less scripted. Voice profiling is the signal customers tell us surprises them.
Where empathic understanding ships
Wellness companion
Tone-matched support
Support agent
De-escalation detection
Coaching
Stress tracking over time
Consumer companion
Emotionally aware dialogue

Transcript + profile, in one call

Sync transcription returns both the text and the voice profile. Stream them over WebSocket for live profile updates.
import fs from 'fs'; const audio = fs.readFileSync('clip.wav').toString('base64'); const resp = await fetch('https://api.inworld.ai/stt/v1/transcribe', { method: 'POST', headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ transcribeConfig: { modelId: 'inworld/inworld-stt-1', audioEncoding: 'AUTO_DETECT', language: 'en-US', }, audioData: { content: audio }, }), }); const { transcription, voiceProfile } = await resp.json(); // voiceProfile = { emotion, age, accent, pitch, vocalStyle } // each with { value, confidence }

FAQ

The profile covers emotion, age, accent, pitch, and vocal style. Each comes with a confidence score between 0 and 1, so you can threshold or decay depending on how the dimension is used downstream.
Voice profiling is returned by the `inworld/inworld-stt-1` model. Other models on the Inworld STT API (Whisper, AssemblyAI) return transcripts only.
`inworld/inworld-stt-1` is English-only at research-preview launch. Transcription in other languages is available through the Whisper and AssemblyAI models on the same endpoint.
Yes. Inside the Realtime API, the profile flows automatically into Router context, so the LLM reasons with the user's emotional state. Router can emit TTS steering instructions that adjust voice tone. This cross-layer carry is unique to the Inworld pipeline.
The profile returns with confidence scores so you can decide when to trust it. An ensemble of a classifier and an LLM-based profiler is being rolled out to improve descriptive accuracy (emotion, style) while keeping user-context accuracy (age, accent) strong. Research preview; expect iteration.
Voice profiling runs on the same zero-data-retention STT pipeline as transcription. GDPR and SOC 2 Type II compliant. On-premise deployment available for regulated environments — contact sales.
Sentiment analysis gives you a score on text after the fact. Voice profiling returns acoustic-derived signal on the audio itself, per utterance, across tone, identity, and acoustic style. The point isn't to label a conversation; it's to change how your agent behaves during one.
Research preview. APIs are stable enough to build on; the profile model itself is improving weekly.

Hear how they feel. Respond like you did.

Voice profile on every utterance. Flows through the pipeline. Makes every reply sound like it knew.
Copyright © 2021-2026 Inworld AI
Emotional Understanding: Voice AI That Hears How People Feel | Inworld AI