Multimodal Companion

This service is a Node.js backend built on Inworld Runtime that powers:

Real‑time Speech‑to‑Text (STT) over WebSocket
Image+Text → LLM → TTS streaming ("ImageChat") over WebSocket
Optional HTTP test endpoints for quick local validation

Unity connects via HTTP to create a session token, then upgrades to WebSocket for interactive audio/text/image exchange.

Prerequisites

Node.js 20+
An Inworld AI account and API key

Get Started

Step 1: Clone the Repository

git clone https://github.com/inworld-ai/multimodal-companion-node
cd multimodal-companion-node

Step 2: Configure Environment Variables

Copy env template and edit values:

cp .env-sample .env
# Edit .env and set INWORLD_API_KEY (base64("apiKey:apiSecret")) and VAD_MODEL_PATH. Optionally set ALLOW_TEST_CLIENT for local HTML testing.

Get your API key from the Inworld Portal.

Step 3: Install & Run

npm install
npm run build
npm start

Server output (expected):

"VAD client initialized"
"STT Graph initialized"
"Server running on http://localhost:<PORT>"
"WebSocket available at ws://localhost:<PORT>/ws?key=<session_key>"

Repository Layout

multimodal-companion-node/
├── src/
│   ├── index.ts              # Express HTTP server, WebSocket upgrade, session/token issuance, auth checks
│   ├── message_handler.ts    # Parses client messages (TEXT, AUDIO, AUDIO_SESSION_END, IMAGE_CHAT)
│   ├── stt_graph.ts          # Builds a single, long-lived STT GraphExecutor used across the process
│   ├── auth.ts               # HMAC auth verification for HTTP/WS (compatible with Unity InworldAuth)
│   ├── constants.ts          # Defaults for audio sample rates, VAD thresholds, text generation config
│   ├── event_factory.ts
│   ├── helpers.ts
│   └── types.ts
├── examples/
│   ├── test-audio.html       # Local test page for audio streaming
│   └── test-image-chat.html  # Local test page for image chat
├── assets/
│   └── models/
│       └── silero_vad.onnx   # VAD model file used for voice activity detection
├── package.json
├── tsconfig.json
└── LICENSE

Configuration

Environment Variables

Required
- INWORLD_API_KEY: base64("apiKey:apiSecret")
- VAD_MODEL_PATH: local path to the VAD model (e.g., silero_vad.onnx)
Optional
- PORT: server port (default 3000)
- HTTP_CHAT_MAX_CONCURRENCY: throttle HTTP /chat concurrency (if enabled)
- ALLOW_TEST_CLIENT: when set to true, enables GET /get_access_token to issue short‑lived { sessionKey, wsToken } for local HTML tests. Do NOT enable in production.

Example .env (you can start from .env-sample):

INWORLD_API_KEY=xxxxxx_base64_apiKey_colon_apiSecret
VAD_MODEL_PATH=assets/models/silero_vad.onnx
PORT=3000
# Enable local HTML test helper endpoint (development only)
# ALLOW_TEST_CLIENT=true

Local HTML Testing

For local development and testing without Unity, you can use the HTML test pages:

Enable the test client endpoint: Set ALLOW_TEST_CLIENT=true in your .env file and restart the server.
Access test pages:
- http://localhost:<PORT>/test-audio - Stream microphone audio
- http://localhost:<PORT>/test-image - Submit prompts with images
How it works: The pages call GET /get_access_token to obtain { sessionKey, wsToken }, then connect to ws://host/ws?key=...&wsToken=....

Security note: ALLOW_TEST_CLIENT is for local development only. Do NOT enable in production. Tokens are short‑lived (5 minutes) and single‑use.

API Reference

Auth Model

HTTP endpoints require HMAC header Authorization: IW1-HMAC-SHA256 ... (see auth.ts), generated by Unity InworldAuth.
WebSocket:
- Preferred: obtain a short-lived wsToken from POST /create-session and connect to /ws?key=<sessionKey>&wsToken=<token>
- Fallback: full Authorization header on the upgrade request (not recommended for clients)

HTTP Endpoints (optional)

POST /create-session (protected)
- Returns { sessionKey, wsToken }
- wsToken is single-use, short-lived
POST /chat (protected, optional)
- Accepts prompt and an optional image file (multipart)
- Runs a one‑off LLM graph (non‑streaming) and returns { response }

WebSocket Flow

Client calls POST /create-session (with HMAC auth) → { sessionKey, wsToken }
Client connects: ws://host/ws?key=<sessionKey>&wsToken=<token>
Client sends messages; server returns text/audio and INTERACTION_END packets

Client → Server messages:

{ type: "text", text: string }
{ type: "audio", audio: number[][] } // streamed float32 chunks
{ type: "audioSessionEnd" } // finalize the STT turn
{ type: "imageChat", text: string, image: string, voiceId?: string } // image is data URL (base64)

Server → Client messages:

TEXT: { text: { text, final }, routing: { source: { isAgent|isUser, name } } }
AUDIO: { audio: { chunk: base64_wav } } (streamed for TTS)
INTERACTION_END: marks end of one turn / execution
ERROR: { error }

Technical Details

Graphs & Executors

STT Graph
- Constructed once at server startup (stt_graph.ts), reused for all STT requests
- For each STT turn: executor.start(input, { executionId: v4() }) → read first result → closeExecution(executionResult.outputStream)
ImageChat Graph (LLM→TextChunking→TTS)
- Per WebSocket connection: one shared executor reused across image+text turns
- Rebuilt only if voiceId changes for that connection
- For each ImageChat turn: executor.start(LLMChatRequest, { executionId: v4() }) → stream TTS chunks → closeExecution(executionResult.outputStream)

Model Provider & Config

LLM provider/model examples
- OpenAI: { provider: 'openai', modelName: 'gpt-4o-mini', stream: true|false }
- Google Gemini: { provider: 'google', modelName: 'gemini-2.5-flash-lite', stream: true }
Text generation (see constants.ts → TEXT_CONFIG)
- temperature, topP, maxNewTokens, penalties, etc.
- Typical: higher temperature/topP → more diverse; lower → more deterministic

Audio & VAD

Input sample rate: 16 kHz (Unity mic)
VAD: Silero VAD (ONNX) local inference for voice activity detection
STT turn starts on voiced audio and finalizes when pauses exceed a threshold (PAUSE_DURATION_THRESHOLD_MS)

Concurrency & Resource Guidelines

STT executor is global and reused (fast first token)
ImageChat executor is per WebSocket connection and reused; serialize turns per connection
Optionally enforce small concurrency limits for HTTP /chat if enabled
Always closeExecution(executionResult.outputStream) after reading results

Deployment Tips (e.g., Railway)

Keep concurrency conservative (2–4) unless the plan allows more resources
Prefer long-lived shared executors with per‑turn start(...)/closeExecution(...)
Expect GOAWAY after long idle periods; allow light retry or lazy re‑init on next turn

Troubleshooting

No image update: Confirm Unity captures a fresh image before each imageChat send
Long STT delay: Verify VAD thresholds and that audioSessionEnd is sent after speech
Frequent GOAWAY on idle: Acceptable; ensure executions are closed and executors are reused
gRPC Deadline Exceeded: Single execution timed out; treat as recoverable (retry once)
HTTP/2 GOAWAY ENHANCE_YOUR_CALM (too_many_pings): Server throttling of idle keepalives; treat as recoverable, rebuild channel/executor on next use
WebSocket "closed without close handshake": Usually process restart/crash or proxy idle-kill; implement client auto‑reconnect (backoff)
"Your graph is not registered": Informational warning about remote variants; safe to ignore unless you explicitly use registry-managed graphs

Bug Reports: GitHub Issues

General Questions: For general inquiries and support, please email us at support@inworld.ai

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to contribute to this project.

License

This project is licensed under the MIT License - see the LICENSE file for details.