
This application demonstrates a simple chat interface with an AI agent that can respond to text and voice inputs, powered by Inworld AI Runtime.
git clone https://github.com/inworld-ai/voice-agent-node
cd voice-agent-node
Copy server/.env-sample to server/.env and fill all required variables. Some variables are optional and can be left empty. In this case default values will be used.
Get your API key from the Inworld Portal.
The client supports optional environment variables to customize its behavior. Create a .env file in the client directory if you want to override defaults:
VITE_ENABLE_LATENCY_REPORTING - Set to true to enable latency reporting in the UI (shows latency chart and latency badges on agent messages). Default: falseVITE_APP_PORT - Server port to connect to. Default: 4000VITE_APP_LOAD_URL - Custom load endpoint URLVITE_APP_UNLOAD_URL - Custom unload endpoint URLVITE_APP_SESSION_URL - Custom session WebSocket URLInstall dependencies for both server and client:
# Install server dependencies
cd server
npm install
# Start the server
npm start
The server will start on port 4000.
# Install client dependencies
cd ../client
npm install
npm start
The client will start on port 3000 and should automatically open in your default browser. It's possible that port 3000 is already in use, so the next available port will be used.
Define the agent settings:
Interact with the agent:
voice-agent-node/
├── server/ # Backend handling Inworld's LLM, STT, and TTS services
│ ├── components/
│ │ ├── graph.ts # Main graph-based pipeline orchestration
│ │ ├── stt_graph.ts # Speech-to-text graph configuration
│ │ ├── message_handler.ts # WebSocket message handling
│ │ ├── audio_handler.ts # Audio stream processing
│ │ └── nodes/ # Graph node implementations (STT, LLM, TTS processing)
│ ├── models/
│ │ └── silero_vad.onnx # VAD model for voice activity detection
│ ├── index.ts # Server entry point
│ ├── package.json
│ └── tsconfig.json
├── client/ # Frontend React application
│ ├── src/
│ │ ├── app/ # UI components (chat, configuration, shared components)
│ │ ├── App.tsx
│ │ └── index.tsx
│ ├── public/
│ ├── package.json
│ └── vite.config.mts
├── constants.ts
└── LICENSE
The voice agent server uses Inworld's Graph Framework with two main processing pipelines:
flowchart TB
subgraph AUDIO["AUDIO INPUT PIPELINE"]
AudioInput[AudioInput]
subgraph OPT1["Assembly.AI STT Pipeline"]
AssemblyAI[AssemblyAI STT]
TranscriptExtractor[TranscriptExtractor]
SpeechNotif1[SpeechCompleteNotifier<br/>terminal node]
AssemblyAI -->|interaction_complete| TranscriptExtractor
AssemblyAI -->|interaction_complete| SpeechNotif1
AssemblyAI -->|stream_exhausted=false<br/>loop| AssemblyAI
end
AudioInput --> OPT1
TranscriptExtractor --> InteractionQueue
end
subgraph TEXT["TEXT PROCESSING & TTS PIPELINE"]
TextInput[TextInput]
DialogPrompt[DialogPromptBuilder]
LLM[LLM]
TextChunk[TextChunking]
TextAgg[TextAggregator]
TTS[TTS<br/>end]
StateUpdate[StateUpdate]
TextInput --> DialogPrompt
DialogPrompt --> LLM
LLM --> TextChunk
LLM --> TextAgg
TextChunk --> TTS
TextAgg --> StateUpdate
StateUpdate -.->|loop optional| InteractionQueue
end
InteractionQueue -->|text.length>0| TextInput
style SpeechNotif1 fill:#f9f,stroke:#333,stroke-width:2px
style TTS fill:#9f9,stroke:#333,stroke-width:2px
The server uses Assembly.AI as the Speech-to-Text provider, which provides high accuracy with built-in speech segmentation.
.env file:INWORLD_API_KEY - Required for Inworld servicesASSEMBLY_AI_API_KEY - Required for speech-to-text functionalityBug Reports: GitHub Issues
General Questions: For general inquiries and support, please email us at support@inworld.ai
git checkout -b feature/amazing-featuregit commit -m 'Add amazing feature'git push origin feature/amazing-featureThis project is licensed under the MIT License - see the LICENSE file for details.