What is voice profiling in speech-to-text?

Voice profiling analyzes speech characteristics beyond words, detecting the speaker's emotional state, accent, age, pitch, and vocal style. Inworld AI STT provides these capabilities alongside standard transcription in a single API call, and they work with any STT model on the Inworld STT API.

How does Realtime STT compare to Deepgram?

Both offer real-time transcription. Deepgram provides Nova-3 STT, Flux Multilingual (May 2026), a Voice Agent API, and TTS. Realtime STT adds voice profiling features like acoustic emotion detection, accent analysis, age, pitch, and vocal style detection that Deepgram does not provide natively, and is part of a broader pipeline that includes the Realtime API and Router for LLM routing.

Does Realtime STT support real-time streaming?

Yes. Realtime STT supports both real-time streaming transcription and batch processing for pre-recorded audio files.

What languages does Realtime STT support?

Realtime STT supports 100+ languages for transcription through Whisper-based models (Groq, AssemblyAI), 60+ languages via Soniox, 30 languages via Inworld's inworld-stt-1, and English plus 10-language multilingual via Deepgram Flux. Voice profiling (emotion, accent, age, pitch, style) is powered by a dedicated analysis model trained on 18 languages and is available with every STT model on the API.

Can I use Realtime STT for call center analytics?

Yes. Voice profiling detects caller emotion, accent, age, pitch, and vocal style in real time, making it ideal for call center quality monitoring and agent assist tools.

What audio formats does Realtime STT accept?

The Sync API accepts LINEAR16 (PCM), MP3, OGG_OPUS, FLAC, and AUTO_DETECT. The WebSocket streaming API accepts LINEAR16 (PCM) and AUTO_DETECT. Recommended specs are 16kHz sample rate, 16-bit depth, mono.

STT API with Voice Profiling: Emotion, Accent, Age, and Pitch Detection

Last updated: July 12, 2026

Inworld AI STT detects caller emotion, accent, age, pitch, and vocal style alongside transcription in a single API call, designed as the input layer for full realtime voice AI pipelines rather than a standalone transcription product. Most STT APIs return a text string and nothing else. Realtime STT returns a full voice profile: how the speaker feels, their accent and linguistic background, and vocal characteristics like age range and pitch. Transcription routes to the model you choose: inworld/inworld-stt-1 (30 languages), Whisper-based models via Groq and AssemblyAI (100+ languages), AssemblyAI Universal-Streaming, Soniox stt-rt-v4/stt-rt-v5 (60+ languages), and Deepgram Flux (English and 10-language multilingual) — see the model comparison. Voice profiling is powered by a dedicated analysis model that runs alongside transcription and works with every one of them. For pure-transcription accuracy benchmarks, Deepgram Nova-3, AssemblyAI Universal-3 Pro, and ElevenLabs Scribe v2 remain strong choices. Inworld STT's role is to feed acoustic context into the Realtime API and downstream LLM and TTS layers. According to Grand View Research, the global speech recognition market reached $14.8 billion in 2025, driven by demand for voice analytics that go beyond raw transcription.

How Does Realtime STT Compare to Other Speech-to-Text APIs?

The core difference: most STT providers treat voice as a text extraction problem. Inworld treats it as a signal-rich data source. This table compares the capabilities that matter for developers building voice-aware applications.

Voice profiling is what separates a transcription service from a voice intelligence API. A call center agent assist tool that only receives text has no idea whether the caller is frustrated or satisfied. With Realtime STT, that context arrives alongside every transcription result.

What Is Voice Profiling and Why Does It Matter?

Voice profiling extracts structured metadata from speech that text alone cannot capture. When a user speaks, the audio signal contains far more information than the words themselves: pitch variation indicates emotional state, speaking rate signals urgency, accent patterns reveal linguistic background, and vocal quality reveals age and style characteristics.

Traditional STT pipelines discard this information. They convert audio to text and throw away the acoustic features. Building voice-aware applications on top of a text-only STT requires bolting on separate ML models for sentiment analysis, speaker identification, and vocal analysis, each adding latency, cost, and integration complexity.

Realtime STT extracts all of this in a single API call. Set voiceProfileConfig: { "enableVoiceProfile": true } inside the request's transcribeConfig, and the response includes the transcript alongside a voice profile object containing emotion labels, accent detection, age estimation, pitch analysis, and vocal style classification, each returned as {label, confidence} arrays.

Voice Profile is powered by a dedicated speech-analysis model that runs alongside transcription. It is trained on 18 languages — English, Spanish, French, German, Italian, Dutch, Portuguese, Polish, Russian, Ukrainian, Hebrew, Turkish, Arabic, Hindi, Mandarin, Cantonese, Japanese, and Korean.

Because Voice Profile is independent of the transcription model, you can enable it with any STT model on the Inworld STT API — Inworld's own inworld-stt-1, AssemblyAI, Soniox, Deepgram, or Groq. Pick the transcription model that fits your latency, accuracy, and language needs; Voice Profile attaches the same speaker analysis to all of them. No other STT gateway offers this.

Five dimensions define Inworld's voice profiling:

Emotion detection. Per-utterance emotional state derived from acoustic features like pitch, energy, and tempo. Not sentiment analysis on the transcript text, but actual vocal analysis. Returns labels with confidence scores.
Accent detection. Automatic detection of the speaker's accent and linguistic background, enabling localized routing and personalized responses without requiring the user to self-select a language.
Age estimation. Estimates the speaker's age range from vocal characteristics, useful for age-appropriate content routing and demographic analytics.
Pitch analysis. Measures the speaker's vocal pitch characteristics, enabling voice-aware application logic.
Vocal style classification. Categorizes the speaker's vocal delivery style, providing context for how (not just what) the speaker is communicating.

How Do You Transcribe Audio with Voice Profiling in Python?

The Realtime STT API accepts audio data via a standard POST request and returns both the transcript and a structured voice profile. Here is a complete Python example using the /stt/v1/transcribe endpoint.

import requests
import base64

# pip install requests
# Auth: base64 of "<api-key>:<api-secret>" from https://platform.inworld.ai
API_KEY_B64 = "YOUR_BASE64_KEY_SECRET"

STT_ENDPOINT = "https://api.inworld.ai/stt/v1/transcribe"

# Read and base64-encode the audio file
with open("call_recording.wav", "rb") as audio_file:
    audio_b64 = base64.b64encode(audio_file.read()).decode("utf-8")

# Transcribe with Realtime STT + Voice Profile
# voiceProfileConfig.enableVoiceProfile is required to get voice profile data.
# topN (default 10) controls how many labels per category are returned.
response = requests.post(
    STT_ENDPOINT,
    headers={
        "Authorization": f"Basic {API_KEY_B64}",
        "Content-Type": "application/json"
    },
    json={
        "transcribeConfig": {
            "modelId": "inworld/inworld-stt-1",
            "language": "en",
            "audioEncoding": "AUTO_DETECT",
            "voiceProfileConfig": {
                "enableVoiceProfile": True,
                "topN": 5
            }
        },
        "audioData": {
            "content": audio_b64
        }
    },
    timeout=60
)
response.raise_for_status()

result = response.json()

# Access the transcript
transcription = result["transcription"]
print("Transcript:", transcription["transcript"])
print("Is final:", transcription["isFinal"])

# Access word-level timestamps
for word in transcription.get("wordTimestamps", []):
    print(f"  '{word['word']}' at {word['startTime']}-{word['endTime']}")

# Voice profile: emotion, accent, age, pitch, vocalStyle — each a
# {label, confidence} array sorted by descending confidence
profile = result.get("voiceProfile", {})
for category in ("emotion", "accent", "age", "pitch", "vocalStyle"):
    for entry in profile.get(category, [])[:1]:
        print(f"{category}: {entry['label']} ({entry['confidence']:.2f})")

print("Usage:", result.get("usage", {}))

The response includes the transcript with word-level timestamps, usage data, and a voiceProfile object with emotion, accent, age, pitch, and vocalStyle arrays. enableVoiceProfile is required — without it, no voice profile data is returned. topN (default 10) controls how many labels per category come back. The same voiceProfileConfig works in the first WebSocket message for real-time streaming — see the Voice Profiles docs.

How Do You Use the STT Voice Profiling API in JavaScript?

For Node.js applications and serverless functions, the same endpoint works with standard HTTP libraries. This example uses the built-in fetch API available in Node.js 18+.

import { readFile } from 'node:fs/promises';

const STT_ENDPOINT = 'https://api.inworld.ai/stt/v1/transcribe';
// Auth: base64 of "<api-key>:<api-secret>" from https://platform.inworld.ai
const API_KEY_B64 = 'YOUR_BASE64_KEY_SECRET';

async function transcribeWithProfiling(audioPath) {
  const audioBuffer = await readFile(audioPath);
  const audioB64 = audioBuffer.toString('base64');

  // voiceProfileConfig.enableVoiceProfile is required to get voice profile data.
  // topN (default 10) controls how many labels per category are returned.
  const response = await fetch(STT_ENDPOINT, {
    method: 'POST',
    headers: {
      'Authorization': `Basic ${API_KEY_B64}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      transcribeConfig: {
        modelId: 'inworld/inworld-stt-1',
        language: 'en',
        audioEncoding: 'AUTO_DETECT',
        voiceProfileConfig: {
          enableVoiceProfile: true,
          topN: 5,
        },
      },
      audioData: {
        content: audioB64,
      },
    }),
  });

  if (!response.ok) {
    throw new Error(`STT API error ${response.status}: ${await response.text()}`);
  }

  const result = await response.json();

  console.log('Transcript:', result.transcription.transcript);
  console.log('Is final:', result.transcription.isFinal);

  // Access word-level timestamps
  for (const word of result.transcription.wordTimestamps || []) {
    console.log(`  '${word.word}' at ${word.startTime}-${word.endTime}`);
  }

  // Voice profile: emotion, accent, age, pitch, vocalStyle — each a
  // {label, confidence} array sorted by descending confidence
  const profile = result.voiceProfile ?? {};
  for (const category of ['emotion', 'accent', 'age', 'pitch', 'vocalStyle']) {
    const [top] = profile[category] ?? [];
    if (top) {
      console.log(`${category}: ${top.label} (${top.confidence.toFixed(2)})`);
    }
  }

  console.log('Usage:', result.usage);

  return result;
}

transcribeWithProfiling('call_recording.wav');

Both examples follow the same pattern: base64-encode the audio, send a JSON request to the transcribe endpoint with a transcribeConfig and audioData body, then parse the structured JSON response. Setting audioEncoding to AUTO_DETECT handles format detection automatically for WAV, MP3, OGG, and FLAC inputs. The language field takes an ISO 639-1 code like "en" (BCP-47 codes such as "en-US" are accepted and auto-converted) — see the supported languages.

What Are the Best Use Cases for STT with Voice Profiling?

Voice profiling enables application patterns that text-only transcription cannot support. Three use cases where acoustic analysis directly improves the user experience:

Call center quality monitoring and agent assist

Contact centers generate millions of hours of voice data. Traditional quality assurance reviews 1-3% of calls manually. With voice profiling, every call is automatically scored on customer emotion trajectory, vocal style changes, and agent empathy patterns.

A real-time agent assist tool built on Realtime STT can detect rising frustration in a caller's voice and surface de-escalation prompts to the human agent before the situation deteriorates. The emotion detection operates on acoustic features, so it identifies frustration even when the caller's words remain polite. McKinsey research shows that AI-assisted agents resolve issues 14% faster than unassisted agents. Real-time voice analytics add another dimension to that assist.

Language learning and accent analysis

Language learning applications need more than correct words. Pronunciation quality, accent drift, and confidence level all affect learning outcomes. Realtime STT's accent detection identifies the speaker's native language influence on their target language pronunciation.

A language tutor built on this API can tell a Spanish-speaking English learner that their vowel sounds are shifting toward Spanish phonemes, or that their speaking confidence (measured by pace and pitch stability) has improved over the past week. This level of feedback was previously only available from human tutors.

Voice agent emotion-aware routing

Conversational AI agents that use STT as their input layer typically route based on transcript text and keyword matching. Voice profiling adds a second routing dimension: vocal emotion and style signals.

A customer saying "I'd like to cancel my subscription" in a calm tone versus the same words delivered with audible frustration represent two different situations requiring different handling. Voice profiling enables the routing logic to detect the frustrated caller's emotional state and escalate to a human agent, while handling the calm cancellation through automated self-service. Combined with Inworld's Realtime API, this creates a full voice agent pipeline with both understanding and response capabilities.

How Does Voice Profiling Work at the API Level?

The Realtime STT API processes audio through multiple parallel analysis stages. When a request arrives at POST /stt/v1/transcribe with enableVoiceProfile set, the pipeline executes:

Acoustic feature extraction. The raw audio is decomposed into mel-spectrograms and prosodic features (pitch contour, energy envelope, speaking rate, pause patterns).
Speech recognition. The audio feeds into the transcription model you selected — inworld-stt-1 or any routed provider model — producing word-level timestamps and the full transcript.
Voice profile analysis. In parallel, a dedicated voice-profiling model analyzes the audio across five dimensions: emotion (categorical labels plus confidence scores), accent characteristics (detected linguistic background), age estimation, pitch analysis, and vocal style classification. Each dimension returns {label, confidence} arrays.
Response assembly. All outputs are merged into a single JSON response with the transcript, word timestamps, voice profile object, and usage data.

This parallel architecture means voice profiling adds minimal latency over standard transcription. Because the profiling model runs alongside transcription rather than as a downstream step, and is independent of which transcription model handles the request, the same analysis attaches to every STT model on the API.

How Does Realtime STT Fit Into a Full Voice Pipeline?

STT with voice profiling is one component of a complete voice AI stack. For applications that need to both understand and respond to speech, the full Inworld pipeline includes:

Speech-to-Text (this API): Audio in, transcript + voice profile out.
Router: Routes the transcript to the optimal LLM across 220+ models in one API, with both 3P providers and 1P Inworld-hosted optimized open-source models.
Text-to-Speech: Converts the LLM response back to natural speech with Realtime TTS-2 research preview, an expressive, low-latency first-party voice model and the #1 realtime TTS.
Realtime API: Orchestrates the full STT-LLM-TTS pipeline with sub-second end-to-end latency, turn-taking, and barge-in support.

Each component works standalone via REST API, or together through the Realtime API for end-to-end conversational AI. The voice profile data from STT can inform LLM prompting (e.g., "the caller sounds frustrated, respond with empathy") and TTS voice selection (e.g., match the response tone to the detected emotion).

Start building with Realtime STT voice profiling: Get your API key

Speech-to-Text API with Voice Profiling: Emotion, Accent, Age, and Pitch Detection