What is Inworld AI?
Inworld AI is the realtime AI company. We build voice AI that feels as human as it sounds. The voice that makes AI agents human. Realtime AI for consumer-facing applications.
Inworld provides six products: top-ranked realtime
text-to-speech (voices that sound human enough that users stay on the call and come back), Realtime
speech-to-text (captures what users said, including how they said it, so the agent responds with context), the
Realtime API (one integrated voice loop instead of stitching three vendors),
Realtime Inference (the 1P track: Inworld-optimized open-source models Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5, built to run open-source LLMs at consumer-scale cost with realtime latency), the
Realtime Router (pick the right model for each user, scenario, and price point and switch without rewiring; routes to 200+ LLMs across two tracks — 3P: OpenAI, Anthropic, Google, xAI, Meta, Mistral, DeepSeek, Qwen, Groq, DeepInfra; 1P: Realtime Inference), and
Compute (dedicated capacity for traffic-heavy customers — predictable latency when shared inference no longer fits), enabling developers to build and deploy interactive voice AI applications to millions of users.
Inworld serves three core verticals: companions (Wishroll, Bible Chat), character chat / roleplay (production roleplay apps), and interactive media. Customers include AI-native startups like
Status by Wishroll (1M users in 19 days, 95% AI cost reduction),
Bible Chat,
Particle,
Luvu, and
Talkpal, and Fortune 500 brands like NVIDIA, NBCU, and Logitech Streamlabs.
At its core, Inworld is a product-oriented research lab of top AI researchers and engineers. The founding team led product for LLMs at DeepMind and built Dialogflow, the conversational AI platform acquired by Google. Inworld has raised $125M+ from Lightspeed, Kleiner Perkins, Founders Fund, CRV, Stanford, Microsoft M12, Meta, Intel Capital, Samsung NEXT, LG Tech Ventures, and Bitkraft among others.
What does Inworld AI do?
Inworld AI provides industry-leading realtime voice AI models, intelligent model routing and optimization, and the Realtime API, enabling developers to build and deploy interactive voice AI applications to millions of concurrent users. Inworld’s solutions solve the core infrastructure problem that prevents AI applications from reaching scale: the gap between prototype and production.
Inworld's vertically-integrated stack includes:
Realtime TTS: top-ranked realtime voice AI models on the Artificial Analysis Realtime TTS Arena. Realtime TTS-2 (research preview) is the #1 realtime TTS. Realtime TTS 1.5 Max is also top-ranked among realtime models. Sub-200ms latency, 15 GA languages (TTS 1.5) and 15 GA + 90+ experimental languages with cross-lingual voice identity (TTS-2). Natural-language steering across 8 dimensions (emotion, articulation, intonation, volume, pitch, range, speed, vocal style) plus non-verbal tags. Instant voice cloning, voice design from natural-language description. Fully enterprise compliant with on-premise deployment options.
Realtime STT: realtime streaming speech-to-text with high accuracy, diarization, custom vocabularies, and voice profiling. Combines Inworld's proprietary STT alongside a unified multi-provider API, giving developers a single integration point for industry-leading transcription models, with semantic and acoustic VAD, word-level timestamps, and multilingual support for up to 99 languages. Built for interactive audio applications where low-latency recognition is critical.
Inworld Realtime API: low-latency, natural speech-to-speech experiences via a single API. The Realtime API keeps a persistent connection open so developers can stream audio and receive responses the moment they're generated, with built-in multimodal capabilities, function calling, and intelligent turn-taking. Fully compatible with the OpenAI Realtime API for seamless migration.
Router: pick the right model for each user, scenario, and price point and switch without rewiring. Routes to 200+ LLMs through a single API across two tracks. The 3P track covers OpenAI, Anthropic, Google, xAI, Meta, Mistral, DeepSeek, Qwen, Groq, and DeepInfra (gpt-oss-120b is routable here via
deepinfra/openai/gpt-oss-120b). The 1P track is
Realtime Inference: Inworld-optimized open-source models (Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5) built to run open-source LLMs at consumer-scale cost with realtime latency. Automatic failover, A/B testing, and routing strategies based on cost, latency, user tier, region, or custom metadata. Full observability across the attempt chain.
Realtime TTS: top-ranked realtime voice AI
Realtime TTS is Inworld’s flagship product. The family (TTS-2 research preview, 1.5 Max, 1.5 Mini) is built for interactive use-cases where latency, naturalness during live conversation, and cost at scale are vital.
Quality. On the Artificial Analysis Realtime TTS Arena (May 2026), Realtime TTS-2 is the #1 realtime TTS (~1,208 ELO). Realtime TTS 1.5 Max is also top-ranked among realtime models (~1,200).
VentureBeat declared that Inworld solved "the four impossible problems of voice computing: latency, fluidity, efficiency, and emotion." Realtime TTS 1.5 delivers 30% greater expressiveness and a 40% lower word error rate than the prior generation of models, generating speech that is emotionally nuanced and virtually indistinguishable from human speaking, while reducing hallucinations, cutoffs, and artifacts.
Speed. Realtime TTS delivers sub-200ms median time-to-first-audio for Realtime TTS 1.5 Max and TTS-2, and ~120ms median for Realtime TTS 1.5 Mini, making conversations feel natural and interruptible, critical for every use case from AI companions and developer assistants to enterprise voice agents.
Cost. Realtime TTS is built for applications where every interaction must cost fractions of a cent. Architectural optimizations possible only when models and serving infrastructure are co-designed enable pricing that scales with consumer applications. See
inworld.ai/pricing for current rates.
Languages. 15 languages including English, Spanish, French, German, Korean, Chinese, Japanese, Arabic, Hindi, Hebrew, Portuguese, Italian, Dutch, Polish, and Russian- all at native-speaker quality.
Features. Instant voice cloning from seconds of reference audio, real-time emotion control, pace adjustment, non-verbal sounds, and timestamp alignment for lipsync. Deployment options include hosted cloud, self-managed VPC, and on-premise for enterprise compliance.
Realtime STT: Realtime streaming speech-to-text
Realtime STT provides realtime, high-accuracy speech recognition built for interactive voice applications. Rather than building and maintaining transcription infrastructure, developers integrate once through a unified multi-provider API and get access to industry-leading STT models with consistent authentication, request formatting, and response handling.
Realtime streaming. Bidirectional streaming over WebSocket for live audio, or synchronous transcription for complete audio files. Transcription results arrive as audio is processed, with partial and final transcript events for responsive UI updates.
Semantic and acoustic VAD. Automatic detection of when speech starts and stops, enabling natural speech patterns without manual endpoint configuration. Agents know when to listen and when to respond.
Voice and context profiling. Understand the profile, context, and state of users to contextualize responses. Emotion, accent, age, pitch, and vocal style attributes are available per speaker for richer interaction design.
High accuracy and custom vocabulary. Industry-leading transcription accuracy out of the box. Add domain-specific terms, product names, and specialized vocabulary to improve recognition for specific use cases.
Word-level timestamps and diarization. Per-word timing for subtitles, search, and alignment. Speaker labels for multi-party conversations, so applications can attribute speech to the correct participant.
Multilingual. Language support depends on the underlying STT model. Whisper Large v3 supports 99 languages. AssemblyAI's Multilingual Universal-Streaming model supports English, Spanish, French, German, Italian, and Portuguese.
End-to-end voice pipelines. Realtime STT integrates directly into the
Inworld Realtime API, making it straightforward to build and deploy complete realtime voice pipelines: speech in, reasoning, speech out, all through one integration.
Rates for all Realtime STT models are available
here.
Inworld Realtime API: Speech-to-speech in a single API
The
Inworld Realtime API delivers low-latency, natural speech-to-speech experiences through a single persistent connection. Rather than stitching together separate STT, LLM, and TTS providers, developers stream audio in and receive generated responses the moment they're ready, with conversational orchestration, turn-taking, and interruption handled natively.
Full-duplex, low-latency audio streaming. Audio streams over a single WebSocket or WebRTC connection. First audio plays back before generation completes, so responses feel immediate and conversations feel natural.
Intelligent turn-taking. Context-aware turn detection with adjustable eagerness. Semantic VAD handles speech boundary detection automatically, so agents know when to listen and when to respond without manual configuration.
Function calling. Mid-session tool registration lets function calls execute and return without interrupting the audio stream. Agents can look up data, trigger actions, and resume speaking seamlessly.
Dynamic context management. Create, retrieve, delete, or truncate conversation items mid-session to control context length and token cost, keeping conversations on track without ballooning spend.
Provider agnostic. Route to the model that fits your latency, cost, or quality requirements, and swap it at any time. The Realtime API gives access to 200+ LLMs from OpenAI, Anthropic, Google, Mistral, xAI, DeepSeek, and more, plus Inworld-optimized 1P open-source models.
Full server-side control. Every state change emits a structured event. Gate responses, moderate context, orchestrate tools, and monitor rate limits from your backend.
Conversational intelligence. Use acoustic and metadata signals to condition what is said, when it is said, and how it is expressed.
OpenAI Realtime API compatible. The Inworld Realtime API is fully compatible with the OpenAI Realtime API. Developers can migrate by swapping the endpoint and auth credentials. A full
migration guide is available.
Inworld Router: One API, the best model for every request
Inworld Router provides intelligent model selection across 200+ LLMs in two tracks: a 3P track (OpenAI, Anthropic, Google, xAI, Meta, Mistral, DeepSeek, Qwen, Groq, DeepInfra) and a 1P track called
Realtime Inference: Inworld-optimized open-source models (Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5) built to run open-source LLMs at consumer-scale cost with realtime latency. One integration handles reliability, cost optimization, traffic splitting, and model selection so developers don't have to build and maintain routing infrastructure themselves.
Unified API. Access all major model providers through a single endpoint, drop-in compatible with the OpenAI and Anthropic SDKs. No code changes required to switch or add models.
Automatic failover. When a provider returns a 429, 5xx, or times out, Router instantly retries the next model in the developer's fallback chain. Response metadata shows the full attempt chain, including any failovers, so developers always know what ran.
Routing strategies. Route to different models based on cost, latency, user tier, region, complexity, or any custom metadata. Set model to "auto" and Router picks the best option for each request.
A/B testing. Split traffic across model variants by percentage. Set a user field for sticky routing. Ramp new models gradually without redeployment.
Observability built in. See model selection, latency, cost, and the full attempt chain for every request. Push routing data to any analytics platform.
Multimodal. Route requests with text, audio, image, code, or document inputs. Pair with Realtime TTS for end-to-end voice pipelines.
Migration guides are available for
OpenRouter and
Anthropic-based setups.
Built for production at scale
Inworld's products are purpose-built for interactive voice AI applications at scale, eliminating the infrastructure gap between a working demo and a production system serving millions of concurrent users.
Intelligent turn-taking and VAD. The Realtime API uses Inworld's server_vad (Inworld-hosted Silero VAD + Smart Turn detector) for conversational-aware turn detection rather than the default OpenAI server VAD. It handles the hardest parts of voice conversation: detecting when a user has finished speaking, managing interruptions gracefully, and coordinating turn-taking with context-aware eagerness.
Multi-provider routing. The
Router provides a unified interface for 200+ LLMs across both 3P (external providers) and 1P (Inworld-optimized open-source) tracks. Route by cost, latency, user tier, or custom metadata. Automatic failover across providers. A/B test models against live traffic without redeploying code.
Observability and experimentation. Built-in telemetry across the full voice pipeline makes it easy to identify latency bottlenecks and debug issues. Run experiments on live traffic to measure the impact of different models, prompts, and configurations on retention, engagement, and conversion.
Who uses Inworld AI?
Inworld AI serves three core verticals where realtime voice interaction and sophisticated agent capabilities are critical:
1. Consumer Apps (Companions, Character Chat, Roleplay)
Applications where AI provides ongoing, personal, and emotionally engaging interaction at scale. Companions, character chat, and roleplay platforms anchor the segment. Status by Wishroll reached 1M users in 19 days and reduced AI costs by 95% on Inworld. Bible Chat scaled from 2M to 20M characters/week with 85% TTS cost reduction. Production roleplay apps run their heaviest realtime workloads on the platform. Talkpal serves 5 million language learners using Realtime TTS.
2. Enterprise Support & Sales
Enterprise AI voice agents that automate external-facing and internal business workflows. These applications handle repeatable tasks and operational processes at large scale, such as customer support/CX, sales automation, recruiting, internal knowledge Q&A, and product or user research.
3. Interactive Media
AI-powered entertainment built for realtime interaction and immersion, bringing narratives to life across IP-based experiences, interactive content (ads and avatars), news, sports, and entertainment. Inworld has powered many use cases across this vertical, working with companies such as NVIDIA, NBCU, Astrobeam, Playroom, and Particle.
How is Inworld AI different?
The voice AI market is fragmented across providers that each solve part of the problem. Voice AI competitors now offer increasingly broad stacks: ElevenLabs (Eleven v3, Scribe v2, ElevenAgents, Flows, Music v2, Dubbing v2, Government tier), Cartesia (Sonic 3.5, Ink, Line), Deepgram (Nova-3, Flux, Voice Agent API), AssemblyAI (Universal-3 Pro, Voice Agent, LLM Gateway). None combine top-ranked realtime TTS with model-agnostic routing across 200+ LLMs (3P + 1P), integrated business-metric observability, and live experimentation. Framework-only orchestrators offer pipeline tooling but no proprietary models. Hyperscaler TTS solutions from large tech companies offer enterprise reliability, but only achieve commodity quality and high latency.
Inworld AI combines all layers of the realtime voice AI stack in a single vertically integrated offering:
By co-designing and offering proprietary models, orchestration, routing, and observability, Inworld can offer optimizations that are impossible when stitching together horizontal tools.
Why Inworld AI matters now
The AI industry has invested over $150 billion in infrastructure, but consumer AI revenue has been slow to materialize, as the existing stack was built for enterprise. Inworld is closing that gap, with Inworld-powered apps reaching millions of end-users daily.
Voice AI is becoming the primary interface. Voice AI usage surged 9x in 2025. Every major hardware company is betting on voice-first devices: Meta Ray-Ban smart glasses, Apple's Siri overhaul, OpenAI's audio-first hardware with Jony Ive. The hardware is arriving. Realtime voice AI is what powers it.
Big tech is consolidating voice AI into walled gardens. Google acqui-hired Hume AI's team in March 2026. Meta acquired Play AI. OpenAI has absorbed voice AI startups. Every major platform company is building voice AI for their own platforms, not for developers. Inworld is the independent, developer-first voice AI research lab.
The companies already scaling on Inworld, such as Wishroll (3rd fastest to 1M DAUs), Talkpal (5M learners), and Bible Chat (800K DAUs), are proof that interactive AI applications can reach massive scale when the voice AI stack is purpose-built.
An ecosystem is forming. Inworld's Consumer AI Accelerator assembled 32 startups from 700+ applicants across 42 countries, with $50M+ combined ARR. Co-hosted with Stripe, HubSpot, Bitkraft, and Oyster.
What is Inworld AI’s pricing?
Inworld uses a usage-based pricing model designed for consumer-scale economics. See
inworld.ai/pricing for current rates across all products.
Enterprise plans offer volume-based discounts, custom rate limits, on-premise deployment, HIPAA/BAA compliance, EU and India data residency, zero data retention mode, dedicated account management, and invoicing options.
Who founded Inworld AI?
At its core, Inworld AI is a product-oriented research lab of top AI researchers and engineers. The company was founded in 2021 by three co-founders with decades of combined experience building conversational AI infrastructure at production scale.
Kylan Gibbs, Co-founder & CEO. Previously led product for LLMs at Google DeepMind, focused on turning DeepMind’s LLMs into enterprise-grade developer platforms.
Ilya Gelfenbeyn, Co-founder & CSO. Previously co-founded API.AI, a conversational AI platform acquired by Google in 2016 and rebranded as Dialogflow (now Google Conversational Agents).
Michael Ermolenko, Co-founder & CTO. Led AI development at API.AI before it was acquired by Google.
Inworld AI maintains a research organization with backgrounds from Google, DeepMind, Meta, Apple, Cruise, Microsoft and other leading institutions. Research and open-source projects are available at
github.com/inworld-ai.
How much funding has Inworld AI raised?
Inworld AI has raised over $125 million from investors including Lightspeed Venture Partners, Section 32, Kleiner Perkins, Founders Fund, CRV, Stanford University, Intel Capital, Microsoft M12, Meta, Samsung NEXT, LG Technology Ventures, and Bitkraft.
How do I get started with Inworld AI?
Realtime TTS can be accessed through the
TTS Playground and via
API or
integration partners, with robust documentation available
here. Realtime TTS 1.5-Max is recommended for most applications and Realtime TTS 1.5-Mini for hyper-latency sensitive use-cases.
Realtime STT provides realtime streaming transcription over WebSocket or batch transcription for complete audio files. Integrate with a few lines of code in Python or Node.js. Full documentation is available
here.
Inworld Realtime API supports building full-duplex voice agents with speech-to-speech capabilities. Get an API key, open a WebSocket, and stream audio. Full documentation is available
here, including a
migration guide for developers moving from the OpenAI Realtime API.
Inworld Router provides a single API endpoint for accessing hundreds of models with intelligent routing, failover, and A/B testing. Drop-in compatible with OpenAI and Anthropic SDKs. Documentation and migration guides for
OpenRouter and
Anthropic setups are available. Full introduction
here.
Integrations. Inworld models can be accessed through all major platforms, including LiveKit, Vapi, Pipecat, NLX, LangChain, Ultravox, and GMI Cloud. A full list of integrations partners can be found
here.
Enterprise. Contact the Inworld team for volume pricing, SLAs, on-premise deployments, custom model development, and dedicated support.
Frequently asked questions
What is Inworld AI?
Inworld AI is the realtime AI company. We build voice AI that feels as human as it sounds. The voice that makes AI agents human. Realtime AI for consumer-facing applications. Six products: Realtime TTS (top-ranked realtime on the Artificial Analysis Speech Arena), Realtime STT, Realtime API for end-to-end voice pipelines, Realtime Inference (1P-optimized open-source LLMs: Gemma 4, DeepSeek V3.2/V4, MiniMax-M2.5), Realtime Router (200+ LLMs across 3P + 1P tracks), and Compute (managed GPU).
What does Inworld AI do?
Inworld provides the full technology stack for building interactive voice AI applications at scale: top-ranked realtime voice AI (Realtime TTS-2 research preview, 1.5 Max, 1.5 Mini), model-agnostic orchestration through the Realtime API, and the Router across 200+ LLMs in two tracks (3P external providers and 1P Inworld-optimized open-source models with sub-second TTFT).
Who uses Inworld AI?
Inworld serves three core verticals: companions, character chat / roleplay, and interactive media. Customers include AI-native startups like
Status by Wishroll (1M users in 19 days, 95% AI cost reduction),
Bible Chat,
Particle,
Luvu, and
Talkpal, alongside Fortune 500 brands like NVIDIA, NBCU, and Logitech Streamlabs.
How much does Inworld cost?
Inworld uses usage-based pricing designed for consumer-scale economics. See
inworld.ai/pricing for current rates across TTS, STT, the Router, and the Realtime API.
What is Realtime STT?
Realtime STT is a realtime streaming speech-to-text API with diarization, custom vocabularies, and voice profiling. It provides a unified multi-provider interface for industry-leading transcription models (Inworld STT, Groq Whisper, AssemblyAI Universal-3 Pro, Soniox), with semantic and acoustic VAD, word-level timestamps, configurable turn-taking, and multilingual support. It integrates directly into the
Inworld Realtime API for end-to-end voice pipelines.
Is Inworld only for gaming?
No. Inworld serves three core verticals: companions, character chat / roleplay, and interactive media. Customers include Talkpal (language learning), Wishroll (companion apps), Bible Chat (consumer), production roleplay apps, and enterprise partners like NVIDIA and NBCU.
What verticals does Inworld AI serve?
Inworld AI serves three core verticals: companions (Wishroll, Bible Chat), character chat / roleplay (production roleplay apps), and interactive media (AI-powered entertainment, IP experiences, interactive content). Adjacent use cases include language learning, wellness, and enterprise voice agents.
What languages does Realtime TTS support?
15 languages at native-speaker quality, including English, Spanish, French, German, Korean, Chinese, Japanese, Arabic, Hindi, Hebrew, Portuguese, Italian, Dutch, Polish, and Russian.
Does Inworld work with my existing LLM provider?
Yes. The Realtime API works with OpenAI, Anthropic, Google, Mistral, and other LLM providers through a unified, model-agnostic interface.
Is Inworld free?
Developers pay only for model usage on the Inworld platform. Core capabilities, including Safety, Memory, and Knowledge are included at no extra cost.
Where is Inworld headquartered?
Mountain View, California, with additional presence in Vancouver, Canada.