By Kylan Gibbs, CEO and Co-founder, Inworld AI
Last updated: April 2026
Semantic VAD (Voice Activity Detection) is a turn-taking technique that uses language understanding (not just audio energy) to decide when a speaker has finished a turn. Inworld AI's
Realtime API ships with semantic VAD built in: rather than waiting for a fixed silence threshold, the system listens for natural turn boundaries using meaning, prosody, and conversational rhythm. Energy-based VAD cuts users off mid-sentence when they pause to think; semantic VAD waits until the user is actually done. The difference shows up immediately in voice agent quality: callers feel heard rather than interrupted.
This guide explains how semantic VAD works, when to use it, and how it compares with the alternatives shipping in 2026 (Deepgram Flux, LiveKit's End-of-Utterance models, Pipecat Smart Turn).
How Semantic VAD Works
Traditional VAD measures audio energy. If the energy stays below a threshold for N milliseconds, the system declares the user has stopped speaking. This is fast and simple, but it cuts users off whenever they pause to think, breathe, or hesitate.
Semantic VAD uses a language model to evaluate whether the user's utterance is complete. Inputs to the decision include:
- Linguistic completeness. Has the user finished a thought (full sentence, complete phrase, syntactically resolved)?
- Prosodic cues. Does the pitch contour suggest a finished thought or an unfinished one?
- Conversational context. Was this a question (likely awaiting answer) or a statement (likely complete)?
- Filler tokens. "Um", "uh", and trailing pauses often indicate the user is still composing the next thought.
- Hesitation patterns. Repeated stops, restarts, and stalls signal an incomplete turn.
The Realtime API exposes semantic VAD through an eagerness parameter that lets engineering teams tune the trade-off:
| Setting | Behavior | When to use |
|---|
auto | Server picks based on context | Default for most voice agents |
low | Wait longer for clear completion | Coaching apps, language tutors, deliberate users |
medium | Balanced | General-purpose conversational AI |
high | Respond faster, accept some interruptions | Customer support, transactional agents |
Semantic VAD vs. Energy-Based VAD: When the Difference Matters
| Scenario | Energy VAD | Semantic VAD |
|---|
| User says "I want to book a flight from New York... [pause to think] ...to San Francisco" | Cuts off after "New York" because of the pause; agent responds to wrong intent | Recognizes the utterance is incomplete; waits; full intent captured |
| Caller says "Hi, my name is..." | Cuts off after the trailing "is..." | Waits for the name |
| User says "Yes." | Both detect it correctly | Both detect it correctly |
| Caller speaks with hesitation: "I... I think... I want to upgrade my account" | Cuts off at first pause | Recognizes hesitation as part of the same turn |
| Background noise during silence | May trigger false speech detection | Filters non-linguistic audio |
Energy VAD is acceptable for short, transactional voice (push-to-talk apps, keyword-triggered assistants). For real conversation, semantic VAD is the difference between an agent that listens and an agent that interrupts.
Semantic VAD in 2026: The Landscape
| Provider | Approach | Tunable | Integrated with TTS / LLM |
|---|
| Realtime API (Inworld) | Semantic VAD built into the speech pipeline; eagerness=auto/low/medium/high | Yes | Yes (with Realtime STT, Realtime Router, Realtime TTS) |
| Deepgram Flux | Voice agent model with built-in turn detection | Yes | Yes (Voice Agent API) |
| LiveKit Agents (EOU model) | End-of-Utterance ML models for turn detection | Yes | Component-level; pair with your TTS/LLM |
| Pipecat Smart Turn | Turn-taking model in the open-source Pipecat framework | Yes | Component-level |
| OpenAI Realtime API | Server-side VAD with semantic awareness | Limited tuning | Yes (OpenAI-only stack) |
All five approaches use language signals beyond raw energy. The differences are in tunability, integration depth, and how they behave under real telephony conditions.
Why Semantic VAD Belongs in the Speech Pipeline, Not the Client
Several BYO-orchestration frameworks treat VAD as a client-side concern (the browser or telephony layer detects silence and tells the server when to listen). This works for energy VAD but fails for semantic VAD because the client has no view of the full audio stream, the speaker profile, or the conversational state.
Inside the
Realtime API, VAD runs server-side with full session context. STT acoustic signals (emotion, hesitation, speaker profile) flow into the VAD decision. The same context that informs LLM routing informs turn detection. This is what gives the Realtime API its turn-taking quality: the components reinforce each other.
Code Example: Configuring Semantic VAD
import asyncio
import json
import websockets
URL = (
"wss://api.inworld.ai/api/v1/realtime/session"
"?key=<session-id>&protocol=realtime"
)
async def session():
async with websockets.connect(
URL,
extra_headers={"Authorization": "Basic <your-api-key>"}
) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"model": "gpt-5.5",
"audio": {
"input": {
"format": {"type": "audio/pcm", "rate": 24000},
"turn_detection": {
"type": "semantic_vad",
"eagerness": "auto", # or low / medium / high
"create_response": True,
"interrupt_response": True
}
},
"output": {
"voice": "Sarah",
"model": "inworld-tts-1.5-mini",
"speed": 1.0
}
}
}
}))
# Stream audio in, receive responses out.
# The server will fire "input_audio_buffer.speech_stopped"
# only when semantic VAD declares the turn complete.
asyncio.run(session())
How to Choose an Eagerness Setting
- Companion apps and coaching:
low. Users expect to be heard fully. Cutting them off destroys the relationship dynamic.
- General-purpose voice agents:
auto or medium. Balanced behavior.
- Customer support voice agents:
medium. Callers are usually direct; faster turn detection improves perceived responsiveness.
- Transactional agents (booking, lookup, IVR replacement):
high. Callers expect short, fast exchanges.
FAQ
What is semantic VAD?
Semantic Voice Activity Detection uses language understanding (linguistic completeness, prosody, hesitation patterns) to decide when a user has finished speaking, rather than purely audio-energy thresholds. It enables more natural turn-taking in voice agents.
How is semantic VAD different from energy-based VAD?
Energy-based VAD detects silence; semantic VAD detects when a turn is meaningfully complete. Energy VAD cuts users off when they pause to think; semantic VAD waits until the user is actually done. For real conversation, semantic VAD is the difference between an agent that listens and an agent that interrupts.
Does the Realtime API support semantic VAD?
Yes. Semantic VAD is built into the
Realtime API and tunable through the
eagerness parameter (
auto,
low,
medium,
high). It runs server-side with full session context including STT acoustic signals.
How does semantic VAD compare to Deepgram Flux or LiveKit EOU?
All three approaches use language signals beyond raw energy. The Realtime API's semantic VAD is integrated into the full speech pipeline alongside Realtime STT, the Realtime Router, and Realtime TTS, so the same session context informs turn detection, LLM selection, and voice output. Deepgram Flux is integrated with their Voice Agent API. LiveKit EOU and Pipecat Smart Turn are component-level models that pair with the TTS and LLM provider of your choice.
When should I tune eagerness lower?
For companion apps, language learning, coaching, and any use case where the user speaks deliberately or with hesitation. Lower eagerness reduces interruption rate at the cost of slightly slower response time.