Multimodal Companion

A multimodal companion that can see your surroundings
Multimodal Companion background
Multimodal Companion

Input

textimagesaudiovideo

Output

textaudioimagesvideo

Use Cases

Customer SupportAI CompanionGamesFitness Trainer

Type

full-stack appcallable endpoint

SDK

Node.js

README.md

GitHub

Multimodal Companion

MIT License Powered by Inworld AI Documentation Model Providers

This service is a Node.js backend built on Inworld Runtime that powers:

  • Real‑time Speech‑to‑Text (STT) over WebSocket
  • Image+Text → LLM → TTS streaming ("ImageChat") over WebSocket
  • Optional HTTP test endpoints for quick local validation

Unity connects via HTTP to create a session token, then upgrades to WebSocket for interactive audio/text/image exchange.

Prerequisites

  • Node.js 20+
  • An Inworld AI account and API key

Get Started

Step 1: Clone the Repository

git clone https://github.com/inworld-ai/multimodal-companion-node
cd multimodal-companion-node

Step 2: Configure Environment Variables

Copy env template and edit values:

cp .env-sample .env
# Edit .env and set INWORLD_API_KEY (base64("apiKey:apiSecret")) and VAD_MODEL_PATH. Optionally set ALLOW_TEST_CLIENT for local HTML testing.

Get your API key from the Inworld Portal.

Step 3: Install & Run

npm install
npm run build
npm start

Server output (expected):

  • "VAD client initialized"
  • "STT Graph initialized"
  • "Server running on http://localhost:<PORT>"
  • "WebSocket available at ws://localhost:<PORT>/ws?key=<session_key>"

Repository Layout

multimodal-companion-node/
├── src/
│   ├── index.ts              # Express HTTP server, WebSocket upgrade, session/token issuance, auth checks
│   ├── message_handler.ts    # Parses client messages (TEXT, AUDIO, AUDIO_SESSION_END, IMAGE_CHAT)
│   ├── stt_graph.ts          # Builds a single, long-lived STT GraphExecutor used across the process
│   ├── auth.ts               # HMAC auth verification for HTTP/WS (compatible with Unity InworldAuth)
│   ├── constants.ts          # Defaults for audio sample rates, VAD thresholds, text generation config
│   ├── event_factory.ts
│   ├── helpers.ts
│   └── types.ts
├── examples/
│   ├── test-audio.html       # Local test page for audio streaming
│   └── test-image-chat.html  # Local test page for image chat
├── assets/
│   └── models/
│       └── silero_vad.onnx   # VAD model file used for voice activity detection
├── package.json
├── tsconfig.json
└── LICENSE

Configuration

Environment Variables

  • Required
    • INWORLD_API_KEY: base64("apiKey:apiSecret")
    • VAD_MODEL_PATH: local path to the VAD model (e.g., silero_vad.onnx)
  • Optional
    • PORT: server port (default 3000)
    • HTTP_CHAT_MAX_CONCURRENCY: throttle HTTP /chat concurrency (if enabled)
    • ALLOW_TEST_CLIENT: when set to true, enables GET /get_access_token to issue short‑lived { sessionKey, wsToken } for local HTML tests. Do NOT enable in production.

Example .env (you can start from .env-sample):

INWORLD_API_KEY=xxxxxx_base64_apiKey_colon_apiSecret
VAD_MODEL_PATH=assets/models/silero_vad.onnx
PORT=3000
# Enable local HTML test helper endpoint (development only)
# ALLOW_TEST_CLIENT=true

Local HTML Testing

For local development and testing without Unity, you can use the HTML test pages:

  1. Enable the test client endpoint: Set ALLOW_TEST_CLIENT=true in your .env file and restart the server.
  2. Access test pages:
    • http://localhost:<PORT>/test-audio - Stream microphone audio
    • http://localhost:<PORT>/test-image - Submit prompts with images
  3. How it works: The pages call GET /get_access_token to obtain { sessionKey, wsToken }, then connect to ws://host/ws?key=...&wsToken=....

Security note: ALLOW_TEST_CLIENT is for local development only. Do NOT enable in production. Tokens are short‑lived (5 minutes) and single‑use.

API Reference

Auth Model

  • HTTP endpoints require HMAC header Authorization: IW1-HMAC-SHA256 ... (see auth.ts), generated by Unity InworldAuth.
  • WebSocket:
    • Preferred: obtain a short-lived wsToken from POST /create-session and connect to /ws?key=<sessionKey>&wsToken=<token>
    • Fallback: full Authorization header on the upgrade request (not recommended for clients)

HTTP Endpoints (optional)

  • POST /create-session (protected)
    • Returns { sessionKey, wsToken }
    • wsToken is single-use, short-lived
  • POST /chat (protected, optional)
    • Accepts prompt and an optional image file (multipart)
    • Runs a one‑off LLM graph (non‑streaming) and returns { response }

WebSocket Flow

  1. Client calls POST /create-session (with HMAC auth) → { sessionKey, wsToken }
  2. Client connects: ws://host/ws?key=<sessionKey>&wsToken=<token>
  3. Client sends messages; server returns text/audio and INTERACTION_END packets

Client → Server messages:

  • { type: "text", text: string }
  • { type: "audio", audio: number[][] } // streamed float32 chunks
  • { type: "audioSessionEnd" } // finalize the STT turn
  • { type: "imageChat", text: string, image: string, voiceId?: string } // image is data URL (base64)

Server → Client messages:

  • TEXT: { text: { text, final }, routing: { source: { isAgent|isUser, name } } }
  • AUDIO: { audio: { chunk: base64_wav } } (streamed for TTS)
  • INTERACTION_END: marks end of one turn / execution
  • ERROR: { error }

Technical Details

Graphs & Executors

  • STT Graph
    • Constructed once at server startup (stt_graph.ts), reused for all STT requests
    • For each STT turn: executor.start(input, { executionId: v4() }) → read first result → closeExecution(executionResult.outputStream)
  • ImageChat Graph (LLM→TextChunking→TTS)
    • Per WebSocket connection: one shared executor reused across image+text turns
    • Rebuilt only if voiceId changes for that connection
    • For each ImageChat turn: executor.start(LLMChatRequest, { executionId: v4() }) → stream TTS chunks → closeExecution(executionResult.outputStream)

Model Provider & Config

  • LLM provider/model examples
    • OpenAI: { provider: 'openai', modelName: 'gpt-4o-mini', stream: true|false }
    • Google Gemini: { provider: 'google', modelName: 'gemini-2.5-flash-lite', stream: true }
  • Text generation (see constants.tsTEXT_CONFIG)
    • temperature, topP, maxNewTokens, penalties, etc.
    • Typical: higher temperature/topP → more diverse; lower → more deterministic

Audio & VAD

  • Input sample rate: 16 kHz (Unity mic)
  • VAD: Silero VAD (ONNX) local inference for voice activity detection
  • STT turn starts on voiced audio and finalizes when pauses exceed a threshold (PAUSE_DURATION_THRESHOLD_MS)

Concurrency & Resource Guidelines

  • STT executor is global and reused (fast first token)
  • ImageChat executor is per WebSocket connection and reused; serialize turns per connection
  • Optionally enforce small concurrency limits for HTTP /chat if enabled
  • Always closeExecution(executionResult.outputStream) after reading results

Deployment Tips (e.g., Railway)

  • Keep concurrency conservative (2–4) unless the plan allows more resources
  • Prefer long-lived shared executors with per‑turn start(...)/closeExecution(...)
  • Expect GOAWAY after long idle periods; allow light retry or lazy re‑init on next turn

Troubleshooting

  • No image update: Confirm Unity captures a fresh image before each imageChat send
  • Long STT delay: Verify VAD thresholds and that audioSessionEnd is sent after speech
  • Frequent GOAWAY on idle: Acceptable; ensure executions are closed and executors are reused
  • gRPC Deadline Exceeded: Single execution timed out; treat as recoverable (retry once)
  • HTTP/2 GOAWAY ENHANCE_YOUR_CALM (too_many_pings): Server throttling of idle keepalives; treat as recoverable, rebuild channel/executor on next use
  • WebSocket "closed without close handshake": Usually process restart/crash or proxy idle-kill; implement client auto‑reconnect (backoff)
  • "Your graph is not registered": Informational warning about remote variants; safe to ignore unless you explicitly use registry-managed graphs

Bug Reports: GitHub Issues

General Questions: For general inquiries and support, please email us at support@inworld.ai

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to contribute to this project.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Copyright © 2021-2025 Inworld AI