Get started

Speech-to-Text API with Voice Profiling: Emotion, Accent, Age, and Pitch Detection

Last updated: April 5, 2026
Inworld AI STT detects caller emotion, accent, age, pitch, and vocal style alongside transcription in a single API call. Most STT APIs return a text string and nothing else. Inworld STT returns a full voice profile: how the speaker feels, their accent and linguistic background, and vocal characteristics like age range and pitch. The result is a speech recognition API that applications can act on in real time, not just transcribe. Transcription supports up to 99+ languages via the Groq/Whisper provider, while voice profiling is available through the inworld/inworld-stt-1 model (currently English-only). According to Grand View Research, the global speech recognition market reached $14.8 billion in 2025, driven by demand for voice analytics that go beyond raw transcription.

How Does Inworld STT Compare to Other Speech-to-Text APIs?

The core difference: most STT providers treat voice as a text extraction problem. Inworld treats it as a signal-rich data source. This table compares the capabilities that matter for developers building voice-aware applications.
Voice profiling is what separates a transcription service from a voice intelligence API. A call center agent assist tool that only receives text has no idea whether the caller is frustrated or satisfied. With Inworld STT, that context arrives alongside every transcription result.

What Is Voice Profiling and Why Does It Matter?

Voice profiling extracts structured metadata from speech that text alone cannot capture. When a user speaks, the audio signal contains far more information than the words themselves: pitch variation indicates emotional state, speaking rate signals urgency, accent patterns reveal linguistic background, and vocal quality reveals age and style characteristics.
Traditional STT pipelines discard this information. They convert audio to text and throw away the acoustic features. Building voice-aware applications on top of a text-only STT requires bolting on separate ML models for sentiment analysis, speaker identification, and vocal analysis, each adding latency, cost, and integration complexity.
Inworld STT extracts all of this in a single API call. When using the inworld/inworld-stt-1 model, the response includes the transcript alongside a voice profile object containing emotion labels, accent detection, age estimation, pitch analysis, and vocal style classification, each returned as {label, confidence} arrays.
Five dimensions define Inworld's voice profiling (via the inworld/inworld-stt-1 model):
  • Emotion detection. Per-utterance emotional state derived from acoustic features like pitch, energy, and tempo. Not sentiment analysis on the transcript text, but actual vocal analysis. Returns labels with confidence scores.
  • Accent detection. Automatic detection of the speaker's accent and linguistic background, enabling localized routing and personalized responses without requiring the user to self-select a language.
  • Age estimation. Estimates the speaker's age range from vocal characteristics, useful for age-appropriate content routing and demographic analytics.
  • Pitch analysis. Measures the speaker's vocal pitch characteristics, enabling voice-aware application logic.
  • Vocal style classification. Categorizes the speaker's vocal delivery style, providing context for how (not just what) the speaker is communicating.

How Do You Transcribe Audio with Voice Profiling in Python?

The Inworld STT API accepts audio data via a standard POST request and returns both the transcript and a structured voice profile. Here is a complete Python example using the /stt/v1/transcribe endpoint.
import requests
import base64
import json

# pip install requests
API_KEY = "YOUR_API_KEY"  # From https://platform.inworld.ai

STT_ENDPOINT = "https://api.inworld.ai/stt/v1/transcribe"

# Read and base64-encode the audio file
with open("call_recording.wav", "rb") as audio_file:
    audio_b64 = base64.b64encode(audio_file.read()).decode("utf-8")

# Transcribe with Inworld STT
# Voice profiling features (emotion, accent, age, pitch, style) are available
# with the Inworld STT model. Check https://docs.inworld.ai/stt/overview
# for the latest supported features and response fields.
response = requests.post(
    STT_ENDPOINT,
    headers={
        "Authorization": f"Basic {API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "transcribeConfig": {
            "modelId": "inworld/inworld-stt-1",
            "audioEncoding": "AUTO_DETECT",
            "language": "en-US"
        },
        "audioData": {
            "content": audio_b64
        }
    },
    timeout=60
)
response.raise_for_status()

result = response.json()

# Access the transcript
transcription = result["transcription"]
print("Transcript:", transcription["transcript"])
print("Is final:", transcription["isFinal"])

# Access word-level timestamps
for word in transcription.get("wordTimestamps", []):
    print(f"  '{word['word']}' at {word['startTime']}-{word['endTime']}")

# Voice profiling fields (emotion, accent, age, pitch, style) may be available
# with the inworld/inworld-stt-1 model. See STT docs for current schema.
print("Usage:", result.get("usage", {}))
The response includes the transcript with word-level timestamps and usage data. Voice profiling features (emotion, accent, age, pitch, vocal style) are available with the inworld/inworld-stt-1 model. Check the STT documentation for the latest supported features and response fields.

How Do You Use the STT Voice Profiling API in JavaScript?

For Node.js applications and serverless functions, the same endpoint works with standard HTTP libraries. This example uses the built-in fetch API available in Node.js 18+.
import { readFile } from 'node:fs/promises';

const STT_ENDPOINT = 'https://api.inworld.ai/stt/v1/transcribe';
const API_KEY = 'YOUR_API_KEY'; // From https://platform.inworld.ai

async function transcribeWithProfiling(audioPath) {
  const audioBuffer = await readFile(audioPath);
  const audioB64 = audioBuffer.toString('base64');

  // Voice profiling features (emotion, accent, age, pitch, style) are available
  // with the Inworld STT model. Check https://docs.inworld.ai/stt/overview
  // for the latest supported features and response fields.
  const response = await fetch(STT_ENDPOINT, {
    method: 'POST',
    headers: {
      'Authorization': `Basic ${API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      transcribeConfig: {
        modelId: 'inworld/inworld-stt-1',
        audioEncoding: 'AUTO_DETECT',
        language: 'en-US',
      },
      audioData: {
        content: audioB64,
      },
    }),
  });

  if (!response.ok) {
    throw new Error(`STT API error ${response.status}: ${await response.text()}`);
  }

  const result = await response.json();

  console.log('Transcript:', result.transcription.transcript);
  console.log('Is final:', result.transcription.isFinal);

  // Access word-level timestamps
  for (const word of result.transcription.wordTimestamps || []) {
    console.log(`  '${word.word}' at ${word.startTime}-${word.endTime}`);
  }

  // Voice profiling fields (emotion, accent, age, pitch, style) may be available
  // with the inworld/inworld-stt-1 model. See STT docs for current schema.
  console.log('Usage:', result.usage);

  return result;
}

transcribeWithProfiling('call_recording.wav');
Both examples follow the same pattern: base64-encode the audio, send a JSON request to the transcribe endpoint with a transcribeConfig and audioData body, then parse the structured JSON response. Setting audioEncoding to AUTO_DETECT handles format detection automatically for WAV, MP3, OGG, and FLAC inputs.

What Are the Best Use Cases for STT with Voice Profiling?

Voice profiling enables application patterns that text-only transcription cannot support. Three use cases where acoustic analysis directly improves the user experience:

Call center quality monitoring and agent assist

Contact centers generate millions of hours of voice data. Traditional quality assurance reviews 1-3% of calls manually. With voice profiling, every call is automatically scored on customer emotion trajectory, vocal style changes, and agent empathy patterns.
A real-time agent assist tool built on Inworld STT can detect rising frustration in a caller's voice and surface de-escalation prompts to the human agent before the situation deteriorates. The emotion detection operates on acoustic features, so it identifies frustration even when the caller's words remain polite. McKinsey research shows that AI-assisted agents resolve issues 14% faster than unassisted agents. Real-time voice analytics add another dimension to that assist.

Language learning and accent analysis

Language learning applications need more than correct words. Pronunciation quality, accent drift, and confidence level all affect learning outcomes. Inworld STT's accent detection identifies the speaker's native language influence on their target language pronunciation.
A language tutor built on this API can tell a Spanish-speaking English learner that their vowel sounds are shifting toward Spanish phonemes, or that their speaking confidence (measured by pace and pitch stability) has improved over the past week. This level of feedback was previously only available from human tutors.

Voice agent emotion-aware routing

Conversational AI agents that use STT as their input layer typically route based on transcript text and keyword matching. Voice profiling adds a second routing dimension: vocal emotion and style signals.
A customer saying "I'd like to cancel my subscription" in a calm tone versus the same words delivered with audible frustration represent two different situations requiring different handling. Voice profiling enables the routing logic to detect the frustrated caller's emotional state and escalate to a human agent, while handling the calm cancellation through automated self-service. Combined with Inworld's Realtime API, this creates a full voice agent pipeline with both understanding and response capabilities.

How Does Voice Profiling Work at the API Level?

The Inworld STT API processes audio through multiple parallel analysis stages. When a request arrives at POST /stt/v1/transcribe with the inworld/inworld-stt-1 model, the pipeline executes:
  1. Acoustic feature extraction. The raw audio is decomposed into mel-spectrograms and prosodic features (pitch contour, energy envelope, speaking rate, pause patterns).
  2. Speech recognition. The acoustic features feed into the transcription model, producing word-level timestamps and the full transcript.
  3. Voice profile analysis. The acoustic features are independently analyzed across five dimensions: emotion (categorical labels plus confidence scores), accent characteristics (detected linguistic background), age estimation, pitch analysis, and vocal style classification. Each dimension returns {label, confidence} arrays.
  4. Response assembly. All outputs are merged into a single JSON response with the transcript, word timestamps, voice profile object, and usage data.
This parallel architecture means voice profiling adds minimal latency over standard transcription. The acoustic features extracted for speech recognition are reused by the profiling models, avoiding redundant computation.

How Does Inworld STT Fit Into a Full Voice Pipeline?

STT with voice profiling is one component of a complete voice AI stack. For applications that need to both understand and respond to speech, the full Inworld pipeline includes:
  • Speech-to-Text (this API): Audio in, transcript + voice profile out.
  • LLM Router: Routes the transcript to the optimal LLM based on task complexity, cost, and latency requirements. Routes to 200+ models through a single endpoint.
  • Text-to-Speech: Converts the LLM response back to natural speech. #1 ranked on Artificial Analysis for voice quality.
  • Realtime API: Orchestrates the full STT-LLM-TTS pipeline with sub-second end-to-end latency, turn-taking, and barge-in support.
Each component works standalone via REST API, or together through the Realtime API for end-to-end conversational AI. The voice profile data from STT can inform LLM prompting (e.g., "the caller sounds frustrated, respond with empathy") and TTS voice selection (e.g., match the response tone to the detected emotion).
Start building with Inworld STT voice profiling: Get your API key
Copyright © 2021-2026 Inworld AI