By Kylan Gibbs, CEO and Co-founder, Inworld AI
Last updated: April 2026
Patient intake is a
use case for Inworld AI's voice infrastructure, not a vertical we specialize in. Our primary focus is realtime voice AI for consumer-facing applications: companions, character chat, and roleplay. That said, the same voice AI stack that powers production consumer apps is suitable for HIPAA-aligned patient intake when deployed with the right compliance posture:
Realtime TTS (#1 realtime TTS on the
Artificial Analysis Realtime TTS Arena, May 2026),
Realtime STT with voice profiling for triage signals, the
Router for routing across 200+ LLMs, and the
Realtime API for end-to-end voice pipelines. SOC 2 Type II, GDPR, on-premise deployment options, and BAAs available on enterprise contracts are part of the standard infrastructure offering.
This guide is for builders evaluating voice AI for patient intake workflows. It is not a turnkey HIPAA-as-a-service offering; HIPAA compliance is a property of your overall system (you, the covered entity, the BAA, your data flows). What we provide is the voice AI infrastructure layer that supports compliant deployment.
What Patient Intake Voice AI Looks Like in 2026
Patient intake voice AI typically handles five workflows:
- Pre-visit triage. Patient describes symptoms; system captures structured data and assigns urgency.
- Identity and insurance verification. Confirms patient identity, looks up coverage, captures policy changes.
- Medication and history review. Walks through current prescriptions, allergies, prior conditions.
- Appointment scheduling. Books, reschedules, or cancels appointments.
- Pre-procedure screening. Pre-surgical or pre-procedure checklists (fasting, medication holds, transportation).
The voice layer must do three things well: sound empathetic, listen carefully (especially to elderly or distressed callers), and never lose the thread when the conversation goes off-script.
Why Voice Quality Matters in Healthcare
Patient-facing voice has a higher quality bar than voice agents in other industries. Three reasons:
- Sub-second response builds trust. A delayed or mechanical voice signals "this is a robot" and patients disengage. Realtime latency (sub-second time-to-first-audio) feels human.
- Empathetic prosody affects disclosure. Patients share more accurate symptom information when the voice sounds warm and unhurried. Realtime TTS-2 (research preview) supports natural-language steering across 8 dimensions (emotion, articulation, intonation, volume, pitch, range, speed, vocal style) for tuning conversational pacing rather than uniform reads.
- Voice profiling improves triage. Inside the Realtime API, Realtime STT extracts speaker profile signals (emotion, hesitation, pitch variation) that the Router can use to route to a more capable LLM when the patient sounds distressed, and that the TTS layer uses to adjust pacing on the response.
Infrastructure Posture for HIPAA-Aligned Voice
Inworld AI infrastructure provides the building blocks for HIPAA-compliant voice AI, while the overall compliance posture is yours to maintain:
- SOC 2 Type II. Independent audit of security controls.
- GDPR. EU privacy compliance for international deployments.
- On-premise deployment. Realtime TTS, Realtime STT, and the Router are available for full on-premise deployment on customer-controlled hardware (H100, A100, or comparable GPUs).
- Zero data retention policy on TTS for production tiers.
- BAAs available on enterprise contracts. Required for HIPAA-covered entities.
For patient intake use cases, SOC 2 + GDPR + on-prem is the right pattern. HIPAA compliance is a function of your overall system; Inworld provides the infrastructure layer that supports it.
Architecture: A Patient Intake Voice Agent
A typical production architecture: patient phone audio enters through a SIP gateway (Twilio, Telnyx) and is bridged to the Realtime API over WebSocket. Inside the Realtime API, Realtime STT transcribes and emits voice profiling signals, the Router selects an LLM based on patient state, and Realtime TTS delivers the response back through the same WebSocket. Function calls made during the conversation handle EHR lookup, scheduling, and insurance verification. Session transcripts are encrypted and stored in a customer-controlled environment.
For on-premise deployment, the entire stack runs inside the customer's perimeter. No PHI leaves customer control.
Code Example: Triage Intake with Voice Profiling
import asyncio
import json
import websockets
URL = (
"wss://api.inworld.ai/api/v1/realtime/session"
"?key=<session-id>&protocol=realtime"
)
INTAKE_INSTRUCTIONS = """
You are a patient intake assistant for a primary care practice.
Your job is to gather: chief complaint, symptom onset, severity (1-10),
relevant medical history, current medications, and allergies.
Speak warmly and unhurriedly. Confirm understanding by paraphrasing.
If the patient sounds distressed or describes a potentially urgent
condition, escalate to a human staff member immediately.
"""
async def patient_intake_session():
async with websockets.connect(
URL,
extra_headers={"Authorization": "Basic <your-api-key>"}
) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"model": "openai/gpt-5.5",
"instructions": INTAKE_INSTRUCTIONS,
"output_modalities": ["audio", "text"],
"audio": {
"input": {
"format": {"type": "audio/pcm", "rate": 24000},
"turn_detection": {
"type": "semantic_vad",
"eagerness": "low", # patient-friendly: don't cut off
"create_response": True,
"interrupt_response": True
}
},
"output": {
"voice": "Sarah",
"model": "inworld-tts-1.5-mini",
"format": {"type": "audio/pcm", "rate": 24000},
"speed": 1.0
}
},
"tools": [
{
"type": "function",
"name": "lookup_patient",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string"},
"dob": {"type": "string"}
}
}
},
{
"type": "function",
"name": "escalate_to_human",
"parameters": {
"type": "object",
"properties": {
"reason": {"type": "string"}
}
}
}
]
}
}))
# Stream patient audio and tool calls
For telephony deployment, bridge this to a SIP provider (Telnyx, Twilio) or use an orchestration partner (LiveKit, Vapi) as the transport layer.
What Good Patient-Voice Looks Like in Practice
Five practical guidelines for healthtech builders:
- Set
eagerness=low. Patients (especially elderly or in pain) speak slowly and pause. Energy-based VAD or aggressive semantic VAD will cut them off.
- Use empathetic system instructions. Direct the model to confirm understanding by paraphrasing, to slow down when the patient sounds distressed, and to escalate immediately on red-flag symptoms.
- Keep voice consistent. Pin one
voiceId for the session. Switching voices mid-call breaks the relationship.
- Route urgent cases to a more capable LLM. Use the Router with conditional routing on detected acoustic signals so distressed patients automatically route to a frontier model (
anthropic/claude-opus-4-7 or openai/gpt-5.5) rather than a budget tier. The Router covers 200+ LLMs across both 3P providers and Inworld-hosted optimized open-source models.
- Always provide a human escalation path. Function calling makes this clean. The agent should be able to hand off mid-call without losing context.
Where This Fits in Inworld's Focus
Inworld's primary focus is realtime voice AI for consumer-facing applications across companions, character chat, and roleplay. Patient intake is one of several enterprise use cases where the same infrastructure (Realtime TTS, Realtime STT with voice profiling, the Router, the Realtime API) applies, paired with the SOC 2 + GDPR + on-premise + BAA posture required for HIPAA-aligned deployments.
FAQ
Is Inworld AI HIPAA compliant?
HIPAA compliance is a property of your overall system, not a single vendor. Inworld AI provides the infrastructure layer (Realtime TTS, Realtime STT, Router, Realtime API) with SOC 2 Type II, GDPR, on-premise deployment options, and BAAs available on enterprise contracts. The overall HIPAA posture (covered entity status, your data flows, your access controls, your audit trail) is yours to maintain. For patient intake use cases, SOC 2 + GDPR + on-prem is the right pattern. Note: patient intake is a use case for our infrastructure, not a primary vertical focus.
Can I deploy patient intake voice AI on-premise?
Yes.
Realtime TTS,
Realtime STT, and the
Router are available for full on-premise deployment on customer-controlled hardware (H100, A100, or comparable GPUs). On-prem deployment includes enterprise support and SLAs.
What voice should I use for patient intake?
Start with a warm, calm voice and tune the conversational pacing through SSML breaks. The default voice "Sarah" works for many use cases; the voice library includes options across demographics. For branded patient experiences, clone your healthcare team's voice or a licensed voice via instant voice cloning (5-15 seconds of audio).
How do I keep PHI out of vendor logs?
Two patterns: (1) on-premise deployment keeps all data in your environment; (2) cloud deployment with zero data retention plus a BAA. Discuss specifics with enterprise sales for your deployment.
How does voice profiling help with triage?
Inside the
Realtime API,
Realtime STT extracts acoustic signals (emotion, hesitation, pitch variation) alongside the transcript. These signals can route distressed patients to a more capable LLM and trigger escalation function calls. The voice profiling layer is what enables clinically-aware turn-taking and routing.