
This service is a Node.js backend built on Inworld Runtime that powers:
Unity connects via HTTP to create a session token, then upgrades to WebSocket for interactive audio/text/image exchange.
git clone https://github.com/inworld-ai/multimodal-companion-node
cd multimodal-companion-node
Copy env template and edit values:
cp .env-sample .env
# Edit .env and set INWORLD_API_KEY (base64("apiKey:apiSecret")) and VAD_MODEL_PATH. Optionally set ALLOW_TEST_CLIENT for local HTML testing.
Get your API key from the Inworld Portal.
npm install
npm run build
npm start
Server output (expected):
"Server running on http://localhost:<PORT>""WebSocket available at ws://localhost:<PORT>/ws?key=<session_key>"multimodal-companion-node/
├── src/
│ ├── index.ts # Express HTTP server, WebSocket upgrade, session/token issuance, auth checks
│ ├── message_handler.ts # Parses client messages (TEXT, AUDIO, AUDIO_SESSION_END, IMAGE_CHAT)
│ ├── stt_graph.ts # Builds a single, long-lived STT GraphExecutor used across the process
│ ├── auth.ts # HMAC auth verification for HTTP/WS (compatible with Unity InworldAuth)
│ ├── constants.ts # Defaults for audio sample rates, VAD thresholds, text generation config
│ ├── event_factory.ts
│ ├── helpers.ts
│ └── types.ts
├── examples/
│ ├── test-audio.html # Local test page for audio streaming
│ └── test-image-chat.html # Local test page for image chat
├── assets/
│ └── models/
│ └── silero_vad.onnx # VAD model file used for voice activity detection
├── package.json
├── tsconfig.json
└── LICENSE
INWORLD_API_KEY: base64("apiKey:apiSecret")VAD_MODEL_PATH: local path to the VAD model (e.g., silero_vad.onnx)PORT: server port (default 3000)HTTP_CHAT_MAX_CONCURRENCY: throttle HTTP /chat concurrency (if enabled)ALLOW_TEST_CLIENT: when set to true, enables GET /get_access_token to issue short‑lived { sessionKey, wsToken } for local HTML tests. Do NOT enable in production.Example .env (you can start from .env-sample):
INWORLD_API_KEY=xxxxxx_base64_apiKey_colon_apiSecret
VAD_MODEL_PATH=assets/models/silero_vad.onnx
PORT=3000
# Enable local HTML test helper endpoint (development only)
# ALLOW_TEST_CLIENT=true
For local development and testing without Unity, you can use the HTML test pages:
ALLOW_TEST_CLIENT=true in your .env file and restart the server.http://localhost:<PORT>/test-audio - Stream microphone audiohttp://localhost:<PORT>/test-image - Submit prompts with imagesGET /get_access_token to obtain { sessionKey, wsToken }, then connect to ws://host/ws?key=...&wsToken=....Security note: ALLOW_TEST_CLIENT is for local development only. Do NOT enable in production. Tokens are short‑lived (5 minutes) and single‑use.
Authorization: IW1-HMAC-SHA256 ... (see auth.ts), generated by Unity InworldAuth.wsToken from POST /create-session and connect to /ws?key=<sessionKey>&wsToken=<token>Authorization header on the upgrade request (not recommended for clients)POST /create-session (protected){ sessionKey, wsToken }wsToken is single-use, short-livedPOST /chat (protected, optional)prompt and an optional image file (multipart){ response }POST /create-session (with HMAC auth) → { sessionKey, wsToken }ws://host/ws?key=<sessionKey>&wsToken=<token>INTERACTION_END packetsClient → Server messages:
{ type: "text", text: string }{ type: "audio", audio: number[][] } // streamed float32 chunks{ type: "audioSessionEnd" } // finalize the STT turn{ type: "imageChat", text: string, image: string, voiceId?: string } // image is data URL (base64)Server → Client messages:
TEXT: { text: { text, final }, routing: { source: { isAgent|isUser, name } } }AUDIO: { audio: { chunk: base64_wav } } (streamed for TTS)INTERACTION_END: marks end of one turn / executionERROR: { error }stt_graph.ts), reused for all STT requestsexecutor.start(input, { executionId: v4() }) → read first result → closeExecution(executionResult.outputStream)voiceId changes for that connectionexecutor.start(LLMChatRequest, { executionId: v4() }) → stream TTS chunks → closeExecution(executionResult.outputStream){ provider: 'openai', modelName: 'gpt-4o-mini', stream: true|false }{ provider: 'google', modelName: 'gemini-2.5-flash-lite', stream: true }constants.ts → TEXT_CONFIG)temperature, topP, maxNewTokens, penalties, etc.temperature/topP → more diverse; lower → more deterministicPAUSE_DURATION_THRESHOLD_MS)/chat if enabledcloseExecution(executionResult.outputStream) after reading resultsstart(...)/closeExecution(...)imageChat sendaudioSessionEnd is sent after speechDeadline Exceeded: Single execution timed out; treat as recoverable (retry once)GOAWAY ENHANCE_YOUR_CALM (too_many_pings): Server throttling of idle keepalives; treat as recoverable, rebuild channel/executor on next useBug Reports: GitHub Issues
General Questions: For general inquiries and support, please email us at support@inworld.ai
We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to contribute to this project.
This project is licensed under the MIT License - see the LICENSE file for details.