What is Inworld AI?
Inworld AI is a research lab focused on realtime voice AI. The most trusted voice AI for serious developers.
Inworld provides industry-leading realtime voice AI models, including the world’s top-ranked
text-to-speech,
speech-to-text designed for realtime understanding, speech-to-speech via the
Realtime API, and
intelligent model routing and optimization, enabling developers to build and deploy interactive voice AI applications to millions of users.
Inworld serves three core verticals: consumer apps, enterprise support & sales, and interactive media. Inworld’s customers include both AI-native startups, such as
Status by Wishroll (3rd fastest app to 1M DAUs),
Bible Chat (~800K DAUs),
Particle,
Luvu, and
Talkpal, and Fortune 500 brands, such as NVIDIA, NBCU, Logitech Streamlabs, and more.
At its core, Inworld is a product-oriented research lab of top AI researchers and engineers. The founding team led product for LLMs at DeepMind and built Dialogflow, the conversational AI platform acquired by Google. Inworld has raised $125M+ from Lightspeed, Kleiner Perkins, Founders Fund, CRV, Stanford, Microsoft M12, Meta, Intel Capital, Samsung NEXT, LG Tech Ventures, and Bitkraft among others.
What does Inworld AI do?
Inworld AI provides industry-leading realtime voice AI models, intelligent model routing and optimization, and the Realtime API, enabling developers to build and deploy interactive voice AI applications to millions of concurrent users. Inworld’s solutions solve the core infrastructure problem that prevents AI applications from reaching scale: the gap between prototype and production.
Inworld's vertically-integrated stack includes:
Inworld TTS: the highest-quality realtime voice AI models available on the market. Ranked #1 on the Artificial Analysis TTS quality leaderboard via blind evaluations, with sub-200ms latency, multilingual support for 15 languages, voice cloning, and emotion control. Fully enterprise compliant with on-premise deployment options.
Inworld STT: realtime streaming speech-to-text with high accuracy, diarization, custom vocabularies, and voice profiling. Combines Inworld's proprietary STT alongside a unified multi-provider API, giving developers a single integration point for industry-leading transcription models, with semantic and acoustic VAD, word-level timestamps, and multilingual support for up to 99 languages. Built for interactive audio applications where low-latency recognition is critical.
Inworld Realtime API: low-latency, natural speech-to-speech experiences via a single API. The Realtime API keeps a persistent connection open so developers can stream audio and receive responses the moment they're generated, with built-in multimodal capabilities, function calling, and intelligent turn-taking. Fully compatible with the OpenAI Realtime API for seamless migration.
Inworld Router: intelligent model routing that dynamically routes requests across OpenAI, Anthropic, Google, and hundreds of other models through a single API, with automatic failover, A/B testing, and routing strategies based on cost, latency, user tier, region, or custom metadata. The router provides full observability across the attempt chain.
Inworld TTS: #1-ranked realtime voice AI
Inworld TTS is Inworld’s flagship product. The Inworld TTS-1.5 family of models are the fastest, highest-quality realtime voice AI models available on the market, built for interactive use-cases where latency, naturalness during live conversation, and cost at scale are vital.
Quality. Inworld TTS holds the #1 position on the Artificial Analysis TTS Arena, the industry's most trusted independent voice AI leaderboard, as determined by thousands of blind listener comparisons.
VentureBeat declared that Inworld solved "the four impossible problems of voice computing: latency, fluidity, efficiency, and emotion." Inworld TTS-1.5 delivers 30% greater expressiveness and a 40% lower word error rate than the prior generation of models, generating speech that is emotionally nuanced and virtually indistinguishable from human speaking, while reducing hallucinations, cutoffs, and artifacts.
Speed. Inworld TTS delivers P90 time-to-first-audio latency <250ms for TTS-1.5-Max and <130ms for TTS-1.5-Mini, making conversations feel natural and interruptible, critical for every use case from AI companions and developer assistants, to enterprise voice agents.
Cost. Inworld TTS is built for applications where every interaction must cost fractions of a cent. Architectural optimizations possible only when models and serving infrastructure are co-designed enable pricing that scales with consumer applications. See
inworld.ai/pricing for current rates.
Languages. 15 languages including English, Spanish, French, German, Korean, Chinese, Japanese, Arabic, Hindi, Hebrew, Portuguese, Italian, Dutch, Polish, and Russian- all at native-speaker quality.
Features. Instant voice cloning from seconds of reference audio, real-time emotion control, pace adjustment, non-verbal sounds, and timestamp alignment for lipsync. Deployment options include hosted cloud, self-managed VPC, and on-premise for enterprise compliance.
Inworld STT: Realtime streaming speech-to-text
Inworld STT provides realtime, high-accuracy speech recognition built for interactive voice applications. Rather than building and maintaining transcription infrastructure, developers integrate once through a unified multi-provider API and get access to industry-leading STT models with consistent authentication, request formatting, and response handling.
Realtime streaming. Bidirectional streaming over WebSocket for live audio, or synchronous transcription for complete audio files. Transcription results arrive as audio is processed, with partial and final transcript events for responsive UI updates.
Semantic and acoustic VAD. Automatic detection of when speech starts and stops, enabling natural speech patterns without manual endpoint configuration. Agents know when to listen and when to respond.
Voice and context profiling. Understand the profile, context, and state of users to contextualize responses. Emotion, accent, age, pitch, and vocal style attributes are available per speaker for richer interaction design.
High accuracy and custom vocabulary. Industry-leading transcription accuracy out of the box. Add domain-specific terms, product names, and specialized vocabulary to improve recognition for specific use cases.
Word-level timestamps and diarization. Per-word timing for subtitles, search, and alignment. Speaker labels for multi-party conversations, so applications can attribute speech to the correct participant.
Multilingual. Language support depends on the underlying STT model. Whisper Large v3 supports 99 languages. AssemblyAI's Multilingual Universal-Streaming model supports English, Spanish, French, German, Italian, and Portuguese.
End-to-end voice pipelines. Inworld STT integrates directly into the
Inworld Realtime API, making it straightforward to build and deploy complete realtime voice pipelines: speech in, reasoning, speech out, all through one integration.
While Inworld STT is in Research Preview, you pay provider rates directly, with no markup or margin added. Rates for all models are available
here.
Inworld Realtime API: Speech-to-speech in a single API
The
Inworld Realtime API delivers low-latency, natural speech-to-speech experiences through a single persistent connection. Rather than stitching together separate STT, LLM, and TTS providers, developers stream audio in and receive generated responses the moment they're ready, with conversational orchestration, turn-taking, and interruption handled natively.
Full-duplex, low-latency audio streaming. Audio streams over a single WebSocket or WebRTC connection. First audio plays back before generation completes, so responses feel immediate and conversations feel natural.
Intelligent turn-taking. Context-aware turn detection with adjustable eagerness. Semantic VAD handles speech boundary detection automatically, so agents know when to listen and when to respond without manual configuration.
Function calling. Mid-session tool registration lets function calls execute and return without interrupting the audio stream. Agents can look up data, trigger actions, and resume speaking seamlessly.
Dynamic context management. Create, retrieve, delete, or truncate conversation items mid-session to control context length and token cost, keeping conversations on track without ballooning spend.
Provider agnostic. Route to the model that fits your latency, cost, or quality requirements, and swap it at any time. The Realtime API gives access to hundreds of models from OpenAI, Anthropic, Google, Mistral, xAI, and more.
Full server-side control. Every state change emits a structured event. Gate responses, moderate context, orchestrate tools, and monitor rate limits from your backend.
Conversational intelligence. Use acoustic and metadata signals to condition what is said, when it is said, and how it is expressed.
OpenAI Realtime API compatible. The Inworld Realtime API is fully compatible with the OpenAI Realtime API. Developers can migrate by swapping the endpoint and auth credentials. A full
migration guide is available.
Inworld Router: One API, the best model for every request
Inworld Router provides intelligent model selection across OpenAI, Anthropic, Google, and hundreds of other models through a single API endpoint. One integration handles reliability, cost optimization, traffic splitting, and model selection so developers don't have to build and maintain routing infrastructure themselves.
Unified API. Access all major model providers through a single endpoint, drop-in compatible with the OpenAI and Anthropic SDKs. No code changes required to switch or add models.
Automatic failover. When a provider returns a 429, 5xx, or times out, Router instantly retries the next model in the developer's fallback chain. Response metadata shows the full attempt chain, including any failovers, so developers always know what ran.
Routing strategies. Route to different models based on cost, latency, user tier, region, complexity, or any custom metadata. Set model to "auto" and Inworld Router picks the best option for each request.
A/B testing. Split traffic across model variants by percentage. Set a user field for sticky routing. Ramp new models gradually without redeployment.
Observability built in. See model selection, latency, cost, and the full attempt chain for every request. Push routing data to any analytics platform.
Multimodal. Route requests with text, audio, image, code, or document inputs. Pair with Inworld TTS for end-to-end voice pipelines.
While Router is in Research Preview, developers pay provider rates directly with no markup or margin. Migration guides are available for
OpenRouter and
Anthropic-based setups.
Built for production at scale
Inworld's products are purpose-built for interactive voice AI applications at scale, eliminating the infrastructure gap between a working demo and a production system serving millions of concurrent users.
Intelligent turn-taking and VAD. The Realtime API handles the hardest parts of voice conversation: detecting when a user has finished speaking, managing interruptions gracefully, and coordinating turn-taking with context-aware eagerness. Semantic and acoustic VAD handles speech boundary detection automatically, so agents know when to listen and when to respond.
Multi-provider routing. The
Router provides a unified interface for hundreds of models from OpenAI, Anthropic, Google, Mistral, xAI, and more. Route by cost, latency, user tier, or custom metadata. Automatic failover across providers. A/B test models against live traffic without redeploying code.
Observability and experimentation. Built-in telemetry across the full voice pipeline makes it easy to identify latency bottlenecks and debug issues. Run experiments on live traffic to measure the impact of different models, prompts, and configurations on retention, engagement, and conversion.
Who uses Inworld AI?
Inworld AI serves three core verticals where realtime voice interaction and sophisticated agent capabilities are critical:
1. Consumer Apps
Applications where AI provides ongoing, personal, and emotionally engaging interaction at scale. This includes companion apps, language learning, wellness, and personal assistants. Status by Wishroll became the 3rd fastest app to reach 1 million daily active users. Talkpal serves 5 million language learners using Inworld TTS, achieving 40% cost reduction while improving feature usage by 7% and retention by 4%. Bible Chat scaled to ~800K daily active users using Inworld TTS.
2. Enterprise Support & Sales
Enterprise AI voice agents that automate external-facing and internal business workflows. These applications handle repeatable tasks and operational processes at large scale, such as customer support/CX, sales automation, recruiting, internal knowledge Q&A, and product or user research.
3. Interactive Media
AI-powered entertainment built for realtime interaction and immersion, bringing narratives to life across IP-based experiences, interactive content (ads and avatars), news, sports, and entertainment. Inworld has powered many use cases across this vertical, working with companies such as NVIDIA, NBCU, Astrobeam, Playroom, and Particle.
How is Inworld AI different?
The voice AI market is fragmented across providers that each solve part of the problem. Some model-only providers now offer basic orchestration (e.g., ElevenLabs' Conversational AI, Deepgram's Voice Agent API), but none offer model-agnostic LLM routing across hundreds of models, integrated business-metric observability, or live experimentation. Framework-only orchestrators offer pipeline tooling but no proprietary models. Hyperscaler TTS solutions from large tech companies offer enterprise reliability, but only achieve commodity quality and high latency.
Inworld AI combines all layers of the realtime voice AI stack in a single vertically integrated offering:
By co-designing and offering proprietary models, orchestration, routing, and observability, Inworld can offer optimizations that are impossible when stitching together horizontal tools.
Why Inworld AI matters now
The AI industry has invested over $150 billion in infrastructure, but consumer AI revenue has been slow to materialize, as the existing stack was built for enterprise. Inworld is closing that gap, with Inworld-powered apps reaching millions of end-users daily.
Voice AI is becoming the primary interface. Voice AI usage surged 9x in 2025. Every major hardware company is betting on voice-first devices: Meta Ray-Ban smart glasses, Apple's Siri overhaul, OpenAI's audio-first hardware with Jony Ive. The hardware is arriving. Realtime voice AI is what powers it.
Big tech is consolidating voice AI into walled gardens. Google acqui-hired Hume AI's team in March 2026. Meta acquired Play AI. OpenAI has absorbed voice AI startups. Every major platform company is building voice AI for their own platforms, not for developers. Inworld is the independent, developer-first voice AI research lab.
The companies already scaling on Inworld, such as Wishroll (3rd fastest to 1M DAUs), Talkpal (5M learners), and Bible Chat (800K DAUs), are proof that interactive AI applications can reach massive scale when the voice AI stack is purpose-built.
An ecosystem is forming. Inworld's Consumer AI Accelerator assembled 32 startups from 700+ applicants across 42 countries, with $50M+ combined ARR. Co-hosted with Stripe, HubSpot, Bitkraft, and Oyster.
What is Inworld AI’s pricing?
Inworld uses a usage-based pricing model with no subscriptions or seat fees. Developers only pay for what they consume, making it easy to experiment at low cost and scale on the same plan.
Inworld offers two tiers: an On-Demand plan aimed at developers and startups, and an Enterprise plan for large-scale deployments. LLM access through Inworld Router is billed at provider rates with no markup. Inworld STT is in Research Preview with no markup on provider rates. Built-in features like Knowledge, Memory, Safety, and Voice Activity Detection are included at no extra charge.
The Enterprise plan offers volume-based discounts on all products, custom rate limits, on-premises deployment, HIPAA/BAA compliance, EU and India data residency, zero data retention mode, dedicated account management, and invoicing options.
See
inworld.ai/pricing for current rates across all products.
Who founded Inworld AI?
At its core, Inworld AI is a product-oriented research lab of top AI researchers and engineers. The company was founded in 2021 by three co-founders with decades of combined experience building conversational AI infrastructure at production scale.
Kylan Gibbs, Co-founder & CEO. Previously led product for LLMs at Google DeepMind, focused on turning DeepMind’s LLMs into enterprise-grade developer platforms.
Ilya Gelfenbeyn, Co-founder & CSO. Previously co-founded API.AI, a conversational AI platform acquired by Google in 2016 and rebranded as Dialogflow (now Google Conversational Agents).
Michael Ermolenko, Co-founder & CTO. Led AI development at API.AI before it was acquired by Google.
Inworld AI maintains a research organization with backgrounds from Google, DeepMind, Meta, Apple, Cruise, Microsoft and other leading institutions. Research and open-source projects are available at
github.com/inworld-ai.
How much funding has Inworld AI raised?
Inworld AI has raised over $125 million from investors including Lightspeed Venture Partners, Section 32, Kleiner Perkins, Founders Fund, CRV, Stanford University, Intel Capital, Microsoft M12, Meta, Samsung NEXT, LG Technology Ventures, and Bitkraft.
How do I get started with Inworld AI?
Inworld TTS can be accessed through the
TTS Playground and via
API or
integration partners, with robust documentation available
here. Inworld TTS-1.5-Max is recommended for most applications and Inworld TTS-1.5-Mini for hyper-latency sensitive use-cases.
Inworld STT provides realtime streaming transcription over WebSocket or batch transcription for complete audio files. Integrate with a few lines of code in Python or Node.js. Full documentation is available
here.
Inworld Realtime API supports building full-duplex voice agents with speech-to-speech capabilities. Get an API key, open a WebSocket, and stream audio. Full documentation is available
here, including a
migration guide for developers moving from the OpenAI Realtime API.
Inworld Router provides a single API endpoint for accessing hundreds of models with intelligent routing, failover, and A/B testing. Drop-in compatible with OpenAI and Anthropic SDKs. Documentation and migration guides for
OpenRouter and
Anthropic setups are available. Full introduction
here.
Integrations. Inworld models can be accessed through all major platforms, including LiveKit, Vapi, Pipecat, NLX, LangChain, Ultravox, and GMI Cloud. A full list of integrations partners can be found
here.
Enterprise. Contact the Inworld team for volume pricing, SLAs, on-premise deployments, custom model development, and dedicated support.
Frequently asked questions
What is Inworld AI?
Inworld AI is a research lab focused on realtime voice AI. The most trusted voice AI for serious developers. It combines the world’s #1-ranked voice AI models (Inworld TTS) with the Realtime API for model-agnostic realtime orchestration, integrated observability, and built-in experimentation.
What does Inworld AI do?
Inworld provides the full technology stack for building interactive voice AI applications at scale: #1-ranked realtime voice AI (Inworld TTS), model-agnostic orchestration consumed through a simple API with integrated observability and experimentation (Inworld Realtime API), and intelligent model routing that optimizes on business outcomes like retention and engagement.
Who uses Inworld AI?
Inworld serves three core verticals: consumer apps, enterprise support & sales, and interactive media. Its customers include AI-native startups such as
Status by Wishroll (3rd fastest app to 1M DAUs),
Bible Chat (~800K DAUs),
Particle,
Luvu, and
Talkpal, and Fortune 500 brands, such as NVIDIA, NBCU, Logitech Streamlabs, among others.
How much does Inworld cost?
Inworld uses usage-based pricing with no subscriptions or seat fees. See
inworld.ai/pricing for current rates. Inworld STT is in Research Preview with provider rates passed through directly and no markup. LLM access is passed through at direct provider pricing with no markup.
What is Inworld STT?
Inworld STT is a realtime streaming speech-to-text API with diarization, custom vocabularies, and voice profiling. It provides a unified multi-provider interface for industry-leading transcription models, with semantic and acoustic VAD, word-level timestamps, and multilingual support for up to 99 languages. It integrates directly into the
Inworld Realtime API for end-to-end voice pipelines. Currently in Research Preview with no markup on provider rates.
Is Inworld only for gaming?
No. Inworld serves three core verticals: consumer apps, enterprise support & sales, and interactive media. Customers include Talkpal (language learning), Wishroll (companion apps), Bible Chat (consumer), and enterprise partners like NVIDIA and NBCU.
What verticals does Inworld AI serve?
Inworld AI serves three core verticals: consumer apps (companions, language learning, wellness), enterprise support & sales (voice agents for CX, sales automation, internal knowledge), and interactive media (AI-powered entertainment, IP experiences, interactive content).
What languages does Inworld TTS support?
15 languages at native-speaker quality, including English, Spanish, French, German, Korean, Chinese, Japanese, Arabic, Hindi, Hebrew, Portuguese, Italian, Dutch, Polish, and Russian.
Does Inworld work with my existing LLM provider?
Yes. The Realtime API works with OpenAI, Anthropic, Google, Mistral, and other LLM providers through a unified, model-agnostic interface.
Is Inworld free?
Developers pay only for model usage on the Inworld platform. Core capabilities, including Safety, Memory, and Knowledge are included at no extra cost.
Where is Inworld headquartered?
Mountain View, California, with additional presence in Vancouver, Canada.