What Is Semantic VAD? (And Why It Matters for Voice Agents)

By Kylan Gibbs, CEO and Co-founder, Inworld AI
Last updated: April 2026

Semantic VAD (Voice Activity Detection) is a turn-taking technique that uses language understanding (not just audio energy) to decide when a speaker has finished a turn. Inworld AI's Realtime API ships with semantic VAD built in: rather than waiting for a fixed silence threshold, the system listens for natural turn boundaries using meaning, prosody, and conversational rhythm. Energy-based VAD cuts users off mid-sentence when they pause to think; semantic VAD waits until the user is actually done. The difference shows up immediately in voice agent quality: callers feel heard rather than interrupted.

This guide explains how semantic VAD works, when to use it, and how it compares with the alternatives shipping in 2026 (Deepgram Flux, LiveKit's End-of-Utterance models, Pipecat Smart Turn).

How Semantic VAD Works

Traditional VAD measures audio energy. If the energy stays below a threshold for N milliseconds, the system declares the user has stopped speaking. This is fast and simple, but it cuts users off whenever they pause to think, breathe, or hesitate.

Semantic VAD uses a language model to evaluate whether the user's utterance is complete. Inputs to the decision include:

Linguistic completeness. Has the user finished a thought (full sentence, complete phrase, syntactically resolved)?
Prosodic cues. Does the pitch contour suggest a finished thought or an unfinished one?
Conversational context. Was this a question (likely awaiting answer) or a statement (likely complete)?
Filler tokens. "Um", "uh", and trailing pauses often indicate the user is still composing the next thought.
Hesitation patterns. Repeated stops, restarts, and stalls signal an incomplete turn.

The Realtime API exposes semantic VAD through an eagerness parameter that lets engineering teams tune the trade-off:

Setting	Behavior	When to use
`auto`	Server picks based on context	Default for most voice agents
`low`	Wait longer for clear completion	Coaching apps, language tutors, deliberate users
`medium`	Balanced	General-purpose conversational AI
`high`	Respond faster, accept some interruptions	Customer support, transactional agents

Semantic VAD vs. Energy-Based VAD: When the Difference Matters

Scenario	Energy VAD	Semantic VAD
User says "I want to book a flight from New York... [pause to think] ...to San Francisco"	Cuts off after "New York" because of the pause; agent responds to wrong intent	Recognizes the utterance is incomplete; waits; full intent captured
Caller says "Hi, my name is..."	Cuts off after the trailing "is..."	Waits for the name
User says "Yes."	Both detect it correctly	Both detect it correctly
Caller speaks with hesitation: "I... I think... I want to upgrade my account"	Cuts off at first pause	Recognizes hesitation as part of the same turn
Background noise during silence	May trigger false speech detection	Filters non-linguistic audio

Energy VAD is acceptable for short, transactional voice (push-to-talk apps, keyword-triggered assistants). For real conversation, semantic VAD is the difference between an agent that listens and an agent that interrupts.

Semantic VAD in 2026: The Landscape

Provider	Approach	Tunable	Integrated with TTS / LLM
Realtime API (Inworld)	Semantic VAD built into the speech pipeline; `eagerness=auto/low/medium/high`	Yes	Yes (with Realtime STT, Realtime Router, Realtime TTS)
Deepgram Flux	Voice agent model with built-in turn detection	Yes	Yes (Voice Agent API)
LiveKit Agents (EOU model)	End-of-Utterance ML models for turn detection	Yes	Component-level; pair with your TTS/LLM
Pipecat Smart Turn	Turn-taking model in the open-source Pipecat framework	Yes	Component-level
OpenAI Realtime API	Server-side VAD with semantic awareness	Limited tuning	Yes (OpenAI-only stack)

All five approaches use language signals beyond raw energy. The differences are in tunability, integration depth, and how they behave under real telephony conditions.

Why Semantic VAD Belongs in the Speech Pipeline, Not the Client

Several BYO-orchestration frameworks treat VAD as a client-side concern (the browser or telephony layer detects silence and tells the server when to listen). This works for energy VAD but fails for semantic VAD because the client has no view of the full audio stream, the speaker profile, or the conversational state.

Inside the Realtime API, VAD runs server-side with full session context. STT acoustic signals (emotion, hesitation, speaker profile) flow into the VAD decision. The same context that informs LLM routing informs turn detection. This is what gives the Realtime API its turn-taking quality: the components reinforce each other.

Code Example: Configuring Semantic VAD

import asyncio
import json
import websockets

URL = (
    "wss://api.inworld.ai/api/v1/realtime/session"
    "?key=<session-id>&protocol=realtime"
)

async def session():
    async with websockets.connect(
        URL,
        extra_headers={"Authorization": "Basic <your-api-key>"}
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "model": "gpt-5.5",
                "audio": {
                    "input": {
                        "format": {"type": "audio/pcm", "rate": 24000},
                        "turn_detection": {
                            "type": "semantic_vad",
                            "eagerness": "auto",  # or low / medium / high
                            "create_response": True,
                            "interrupt_response": True
                        }
                    },
                    "output": {
                        "voice": "Sarah",
                        "model": "inworld-tts-1.5-mini",
                        "speed": 1.0
                    }
                }
            }
        }))
        # Stream audio in, receive responses out.
        # The server will fire "input_audio_buffer.speech_stopped"
        # only when semantic VAD declares the turn complete.

asyncio.run(session())

How to Choose an Eagerness Setting

Companion apps and coaching: low. Users expect to be heard fully. Cutting them off destroys the relationship dynamic.
General-purpose voice agents: auto or medium. Balanced behavior.
Customer support voice agents: medium. Callers are usually direct; faster turn detection improves perceived responsiveness.
Transactional agents (booking, lookup, IVR replacement): high. Callers expect short, fast exchanges.

FAQ

What is semantic VAD?

Semantic Voice Activity Detection uses language understanding (linguistic completeness, prosody, hesitation patterns) to decide when a user has finished speaking, rather than purely audio-energy thresholds. It enables more natural turn-taking in voice agents.

How is semantic VAD different from energy-based VAD?

Energy-based VAD detects silence; semantic VAD detects when a turn is meaningfully complete. Energy VAD cuts users off when they pause to think; semantic VAD waits until the user is actually done. For real conversation, semantic VAD is the difference between an agent that listens and an agent that interrupts.

Does the Realtime API support semantic VAD?

Yes. Semantic VAD is built into the Realtime API and tunable through the eagerness parameter (auto, low, medium, high). It runs server-side with full session context including STT acoustic signals.

How does semantic VAD compare to Deepgram Flux or LiveKit EOU?

All three approaches use language signals beyond raw energy. The Realtime API's semantic VAD is integrated into the full speech pipeline alongside Realtime STT, the Realtime Router, and Realtime TTS, so the same session context informs turn detection, LLM selection, and voice output. Deepgram Flux is integrated with their Voice Agent API. LiveKit EOU and Pipecat Smart Turn are component-level models that pair with the TTS and LLM provider of your choice.

When should I tune eagerness lower?

For companion apps, language learning, coaching, and any use case where the user speaks deliberately or with hesitation. Lower eagerness reduces interruption rate at the cost of slightly slower response time.