By Kylan Gibbs, CEO and Co-founder, Inworld AI
Last updated: April 2026
Voice AI for patient intake automates the structured information collection that traditionally consumes nurse and front-desk time: identity verification, symptom triage, medication review, insurance lookup, and pre-visit screening. Inworld AI provides the production voice infrastructure healthtech teams build on:
Realtime TTS (#1 on the
Artificial Analysis Speech Arena, three of the top five),
Realtime STT with voice profiling for triage signals, the
Realtime Router for model selection, and the
Realtime API for end-to-end voice pipelines. We provide infrastructure that healthtech builders use to build HIPAA-aligned voice products: SOC 2 Type II, GDPR, on-premise deployment options, and BAAs available on enterprise contracts.
This guide is for healthtech builders evaluating voice AI for patient intake. It is not a turnkey HIPAA-as-a-service offering; HIPAA compliance is a property of your overall system (you, the covered entity, the BAA, your data flows). What we provide is the voice AI infrastructure layer that supports compliant deployment.
What Patient Intake Voice AI Looks Like in 2026
Patient intake voice AI typically handles five workflows:
- Pre-visit triage. Patient describes symptoms; system captures structured data and assigns urgency.
- Identity and insurance verification. Confirms patient identity, looks up coverage, captures policy changes.
- Medication and history review. Walks through current prescriptions, allergies, prior conditions.
- Appointment scheduling. Books, reschedules, or cancels appointments.
- Pre-procedure screening. Pre-surgical or pre-procedure checklists (fasting, medication holds, transportation).
The voice layer must do three things well: sound empathetic, listen carefully (especially to elderly or distressed callers), and never lose the thread when the conversation goes off-script.
Why Voice Quality Matters in Healthcare
Patient-facing voice has a higher quality bar than voice agents in other industries. Three reasons:
- Sub-second response builds trust. A delayed or mechanical voice signals "this is a robot" and patients disengage. Sub-200ms time-to-first-audio feels human.
- Empathetic prosody affects disclosure. Patients share more accurate symptom information when the voice sounds warm and unhurried. Realtime TTS is engineered for natural conversational pacing rather than uniform reads.
- Voice profiling improves triage. Inside the Realtime API, Realtime STT extracts speaker profile signals (emotion, hesitation, pitch variation) that the Router uses to route to a more capable LLM when the patient sounds distressed, and that the TTS layer uses to adjust pacing on the response.
Infrastructure Posture for HIPAA-Aligned Voice
Inworld AI infrastructure provides the building blocks for HIPAA-compliant voice AI, while the overall compliance posture is yours to maintain:
- SOC 2 Type II. Independent audit of security controls.
- GDPR. EU privacy compliance for international deployments.
- On-premise deployment. Realtime TTS, Realtime STT, and the Realtime Router are available for full on-premise deployment on customer-controlled hardware (H100, A100, or comparable GPUs).
- Zero data retention policy on TTS for production tiers.
- BAAs available on enterprise contracts. Required for HIPAA-covered entities.
For most healthtech use cases, SOC 2 + GDPR + on-prem is the right pattern. HIPAA compliance is a function of your overall system; we provide the infrastructure layer that supports it.
Architecture: A Patient Intake Voice Agent
The simplest production architecture:
[Patient phone call] --> Twilio/Telnyx SIP gateway -->
[WebSocket bridge] --> Realtime API (server-side) -->
--> Realtime STT (with voice profiling for triage signals)
--> Realtime Router (selects LLM based on patient state)
--> Realtime TTS (delivers response)
--> [Patient phone audio]
[Function calls during conversation] --> EHR lookup, scheduling, insurance verification
[Session log] --> Encrypted storage in customer-controlled environment
For on-premise deployment, the entire stack runs inside the customer's perimeter. No PHI leaves customer control.
Code Example: Triage Intake with Voice Profiling
import asyncio
import json
import websockets
URL = (
"wss://api.inworld.ai/api/v1/realtime/session"
"?key=<session-id>&protocol=realtime"
)
INTAKE_INSTRUCTIONS = """
You are a patient intake assistant for a primary care practice.
Your job is to gather: chief complaint, symptom onset, severity (1-10),
relevant medical history, current medications, and allergies.
Speak warmly and unhurriedly. Confirm understanding by paraphrasing.
If the patient sounds distressed or describes a potentially urgent
condition, escalate to a human staff member immediately.
"""
async def patient_intake_session():
async with websockets.connect(
URL,
extra_headers={"Authorization": "Basic <your-api-key>"}
) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"model": "gpt-5.5",
"instructions": INTAKE_INSTRUCTIONS,
"output_modalities": ["audio", "text"],
"audio": {
"input": {
"format": {"type": "audio/pcm", "rate": 24000},
"turn_detection": {
"type": "semantic_vad",
"eagerness": "low", # patient-friendly: don't cut off
"create_response": True,
"interrupt_response": True
}
},
"output": {
"voice": "Sarah",
"model": "inworld-tts-1.5-mini",
"format": {"type": "audio/pcm", "rate": 24000},
"speed": 1.0
}
},
"tools": [
{
"type": "function",
"name": "lookup_patient",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string"},
"dob": {"type": "string"}
}
}
},
{
"type": "function",
"name": "escalate_to_human",
"parameters": {
"type": "object",
"properties": {
"reason": {"type": "string"}
}
}
}
]
}
}))
# Stream patient audio and tool calls
For telephony deployment, bridge this to a SIP provider (Telnyx, Twilio) or use an orchestration partner (LiveKit, Vapi) as the transport layer.
What Good Patient-Voice Looks Like in Practice
Five practical guidelines for healthtech builders:
- Set
eagerness=low. Patients (especially elderly or in pain) speak slowly and pause. Energy-based VAD or aggressive semantic VAD will cut them off.
- Use empathetic system instructions. Direct the model to confirm understanding by paraphrasing, to slow down when the patient sounds distressed, and to escalate immediately on red-flag symptoms.
- Keep voice consistent. Pin one
voiceId for the session. Switching voices mid-call breaks the relationship.
- Route urgent cases to a more capable LLM. Use the Realtime Router with conditional routing on detected acoustic signals so distressed patients automatically route to a frontier model (Claude Opus 4.7 or GPT-5.5) rather than a budget tier.
- Always provide a human escalation path. Function calling makes this clean. The agent should be able to hand off mid-call without losing context.
Production Examples in Health and Wellness
Voice AI in health and wellness is an emerging category for the Inworld stack. The broader health and wellness category includes companies like Wysa and Calm whose products require empathetic voice and compliance posture; the patterns described in this guide apply.
FAQ
Is Inworld AI HIPAA compliant?
HIPAA compliance is a property of your overall system, not a single vendor. Inworld AI provides the infrastructure layer (Realtime TTS, Realtime STT, Realtime Router, Realtime API) with SOC 2 Type II, GDPR, on-premise deployment options, and BAAs available on enterprise contracts. The overall HIPAA posture (covered entity status, your data flows, your access controls, your audit trail) is yours to maintain. For most healthtech use cases, SOC 2 + GDPR + on-prem is the right pattern.
Can I deploy patient intake voice AI on-premise?
Yes.
Realtime TTS,
Realtime STT, and the
Realtime Router are available for full on-premise deployment on customer-controlled hardware (H100, A100, or comparable GPUs). On-prem deployment includes enterprise support and SLAs.
What voice should I use for patient intake?
Start with a warm, calm voice and tune the conversational pacing through SSML breaks. The default voice "Sarah" works for many use cases; the full library of 271+ voices includes options across demographics. For branded patient experiences, clone your healthcare team's voice or a licensed voice via instant voice cloning (5-15 seconds of audio).
How do I keep PHI out of vendor logs?
Two patterns: (1) on-premise deployment keeps all data in your environment; (2) cloud deployment with zero data retention plus a BAA. Discuss specifics with enterprise sales for your deployment.
How does voice profiling help with triage?
Inside the
Realtime API,
Realtime STT extracts acoustic signals (emotion, hesitation, pitch variation) alongside the transcript. These signals can route distressed patients to a more capable LLM and trigger escalation function calls. The voice profiling layer is what enables clinically-aware turn-taking and routing.