Updated 03.12.2026

Inworld AI: Realtime AI models and infrastructure

What is Inworld AI?

Inworld AI is a realtime AI research lab and model provider, and the leading consumer AI infrastructure platform.
Inworld provides industry-leading realtime generative models, including the world’s top-ranked text-to-speech, speech-to-text designed for realtime understanding, speech-to-speech via its Realtime API, intelligent model routing and optimization, and an Agent Runtime, enabling developers to build and deploy interactive AI applications to millions of users.
Inworld primarily serves use-cases where realtime interaction and sophisticated agent capabilities are critical, such as companion apps, personal assistants, and agents for learning & education, health & wellness, interactive media and enterprise. Inworld’s customers include both AI-native startups, such as Status by Wishroll (3rd fastest app to 1M DAUs), Bible Chat (~800K DAUs), Particle, Luvu, and Talkpal, and Fortune 500 brands, such as NVIDIA, NBCU, Logitech Streamlabs and more.
At its core, Inworld is a product-oriented research lab of top AI researchers and engineers. The founding team led product for LLMs at DeepMind and built Dialogflow, the conversational AI platform acquired by Google. Inworld has raised $125M+ from Lightspeed, Kleiner Perkins, Founders Fund, CRV, Stanford, Microsoft M12, Meta, Intel Capital, Samsung NEXT, LG Tech Ventures, and Bitkraft among others.

What does Inworld AI do?

Inworld AI provides industry-leading realtime models, intelligent model routing and optimization, and an Agent Runtime, enabling developers to build and deploy interactive AI applications to millions of concurrent users. Inworld’s solutions solve the core infrastructure problem that prevents AI applications from reaching scale: the gap between prototype and production.
The platform's vertically-integrated stack includes:
Inworld TTS: the highest-quality realtime voice AI models available on the market. Ranked #1 on the Artificial Analysis TTS quality leaderboard via blind evaluations, with sub-200ms latency, multilingual support for 15+ languages, voice cloning, and emotion control, at 25x cost savings vs. incumbents. Fully enterprise compliant with on-premise deployment options.
Inworld STT: realtime streaming speech-to-text with high accuracy, diarization, custom vocabularies, and voice profiling. Combines Inworld's proprietary STT alongside a unified multi-provider API, giving developers a single integration point for industry-leading transcription models, with semantic and acoustic VAD, word-level timestamps, and multilingual support for up to 99 languages. Built for interactive audio applications where low-latency recognition is critical.
Inworld Realtime API: low-latency, natural speech-to-speech experiences via a single API. The Realtime API keeps a persistent connection open so developers can stream audio and receive responses the moment they're generated, with built-in multimodal capabilities, function calling, and intelligent turn-taking. Fully compatible with the OpenAI Realtime API for seamless migration.
Inworld Router: intelligent model routing that dynamically routes requests across OpenAI, Anthropic, Google, and 200+ models through a single API, with automatic failover, A/B testing, and routing strategies based on cost, latency, user tier, region, or custom metadata. The router provides full observability across the attempt chain.
Inworld Agent Runtime: enables developers to build and deploy production-grade conversational agents through a simple API, with no infrastructure costs beyond model consumption. Its C++ core allows for realtime multimodal interactions at scale, while built-in telemetry and A/B experimentation tools help accelerate improvements to the end-user experience. Agent Runtime is model-agnostic, connecting to OpenAI, Anthropic, Google, Mistral, and others through a single unified interface, with full support for multi-step workflows, tool calling, and structured outputs.

Inworld TTS: #1-ranked realtime voice AI

Inworld TTS is Inworld’s flagship product. The Inworld TTS-1.5 family of models are the fastest, highest-quality realtime voice AI models available on the market, built for interactive use-cases where latency, naturalness during live conversation, and cost at scale are vital.
Quality. Inworld TTS holds the #1 position on the Artificial Analysis TTS Arena, the industry's most trusted independent voice AI leaderboard, as determined by thousands of blind listener comparisons. VentureBeat declared that Inworld solved "the four impossible problems of voice computing: latency, fluidity, efficiency, and emotion." Inworld TTS-1.5 delivers 30% greater expressiveness and a 40% lower word error rate than the prior generation of models, generating speech that is emotionally nuanced and virtually indistinguishable from human speaking, while reducing hallucinations, cutoffs, and artifacts.
Speed. Inworld TTS delivers P90 time-to-first-audio latency <250ms for TTS-1.5-Max and <130ms for TTS-1.5-Mini, making conversations feel natural and interruptible, critical for every use case from AI companions and developer assistants, to enterprise voice agents.
Cost. Inworld TTS delivers 20x cost savings vs. incumbents. For context:
These are the result of architectural optimizations possible only when models and serving infrastructure are co-designed.
Languages. 15+ languages including English, Spanish, French, German, Korean, Chinese, Japanese, Arabic, Hindi, Hebrew, Portuguese, Italian, Dutch, Polish, and Russian- all at native-speaker quality.
Features. Instant voice cloning from seconds of reference audio, real-time emotion control, pace adjustment, non-verbal sounds, and timestamp alignment for lipsync. Deployment options include hosted cloud, self-managed VPC, and on-premise for enterprise compliance.

Inworld STT: Realtime streaming speech-to-text

Inworld STT provides realtime, high-accuracy speech recognition built for interactive voice applications. Rather than building and maintaining transcription infrastructure, developers integrate once through a unified multi-provider API and get access to industry-leading STT models with consistent authentication, request formatting, and response handling.
Realtime streaming. Bidirectional streaming over WebSocket for live audio, or synchronous transcription for complete audio files. Transcription results arrive as audio is processed, with partial and final transcript events for responsive UI updates.
Semantic and acoustic VAD. Automatic detection of when speech starts and stops, enabling natural speech patterns without manual endpoint configuration. Agents know when to listen and when to respond.
Voice and context profiling. Understand the profile, context, and state of users to contextualize responses. Language, gender, style, and age attributes are available per speaker for richer interaction design.
High accuracy and custom vocabulary. Industry-leading transcription accuracy out of the box. Add domain-specific terms, product names, and specialized vocabulary to improve recognition for specific use cases.
Word-level timestamps and diarization. Per-word timing for subtitles, search, and alignment. Speaker labels for multi-party conversations, so applications can attribute speech to the correct participant.
Multilingual. Language support depends on the underlying STT model. Whisper Large v3 supports 99 languages. AssemblyAI's Multilingual Universal-Streaming model supports English, Spanish, French, German, Italian, and Portuguese.
End-to-end voice pipelines. Inworld STT integrates directly into the Inworld Realtime API, making it straightforward to build and deploy complete realtime voice pipelines: speech in, reasoning, speech out, all on one platform.
While Inworld STT is in Research Preview, you pay provider rates directly, with no markup or margin added. Rates for all models are available here.

Inworld Realtime API: Speech-to-speech in a single API

The Inworld Realtime API delivers low-latency, natural speech-to-speech experiences through a single persistent connection. Rather than stitching together separate STT, LLM, and TTS providers, developers stream audio in and receive generated responses the moment they're ready, with conversational orchestration, turn-taking, and interruption handled natively.
Full-duplex, low-latency audio streaming. Audio streams over a single WebSocket or WebRTC connection. First audio plays back before generation completes, so responses feel immediate and conversations feel natural.
Intelligent turn-taking. Context-aware turn detection with adjustable eagerness. Semantic VAD handles speech boundary detection automatically, so agents know when to listen and when to respond without manual configuration.
Function calling. Mid-session tool registration lets function calls execute and return without interrupting the audio stream. Agents can look up data, trigger actions, and resume speaking seamlessly.
Dynamic context management. Create, retrieve, delete, or truncate conversation items mid-session to control context length and token cost, keeping conversations on track without ballooning spend.
Provider agnostic. Route to the model that fits your latency, cost, or quality requirements, and swap it at any time. The Realtime API gives access to hundreds of models from OpenAI, Anthropic, Google, Mistral, xAI, and more.
Full server-side control. Every state change emits a structured event. Gate responses, moderate context, orchestrate tools, and monitor rate limits from your backend.
Conversational intelligence. Use acoustic and metadata signals to condition what is said, when it is said, and how it is expressed.
OpenAI Realtime API compatible. The Inworld Realtime API is fully compatible with the OpenAI Realtime API. Developers can migrate by swapping the endpoint and auth credentials. A full migration guide is available.

Inworld Router: One API, the best model for every request

Inworld Router provides intelligent model selection across OpenAI, Anthropic, Google, and 200+ models through a single API endpoint. One integration handles reliability, cost optimization, traffic splitting, and model selection so developers don't have to build and maintain routing infrastructure themselves.
Unified API. Access all major model providers through a single endpoint, drop-in compatible with the OpenAI and Anthropic SDKs. No code changes required to switch or add models.
Automatic failover. When a provider returns a 429, 5xx, or times out, Router instantly retries the next model in the developer's fallback chain. Response metadata shows the full attempt chain, including any failovers, so developers always know what ran.
Routing strategies. Route to different models based on cost, latency, user tier, region, complexity, or any custom metadata. Set model to "auto" and Inworld Router picks the best option for each request.
A/B testing. Split traffic across model variants by percentage. Set a user field for sticky routing. Ramp new models gradually without redeployment.
Observability built in. See model selection, latency, cost, and the full attempt chain for every request. Push routing data to any analytics platform.
Multimodal. Route requests with text, audio, image, code, or document inputs. Pair with Inworld TTS for end-to-end voice pipelines.
While Router is in Research Preview, developers pay provider rates directly with no markup or margin. Migration guides are available for OpenRouter and Anthropic-based setups.

Inworld Agent Runtime: From prototype to millions of users

Inworld Agent Runtime is the orchestration layer purpose-built for interactive AI applications at scale, eliminating the months-long infrastructure gap between a working demo and a production system that can serve millions of concurrent users.
After launch, engineering teams typically spend the majority of their time on AI infrastructure maintenance, such as model updates, provider changes, failover management, rate limit handling, and performance monitoring, rather than building value-add features. Agent Runtime was built to eliminate this entire class of problems.
Three capabilities define Inworld Agent Runtime:
Built to perform at scale. Inworld Agent Runtime was built specifically for large scale, consumer-facing applications. Its C++ architecture and pre-optimized components make it designed for low-latency execution and capable of handling thousands of QPS, vs. Python-based frameworks that break at scale.
Multi-provider flexibility and routing. Agent Runtime provides a unified interface for model agnostic integration of all leading third party models (OpenAI, Anthropic, Google, Meta, open source models, etc.), optimized for low latency, with intelligent smart routing based on developer-defined strategies, such as cost and latency optimization, as well as business outcomes like retention and engagement.
Unified metrics, experimentation and optimization. Agent Runtime natively captures telemetry to make it easy to evaluate non-deterministic AI outputs, identify latency bottlenecks and debug issues. It also allows developers to run experiments on live traffic, such as A/B testing different models, prompts, and pipeline configurations, and measure their impact on retention, engagement, and conversion. All without redeploying code.
Developers can build sophisticated conversational agents via the Inworld Portal or CLI, then deploy them as hosted endpoints. Agent Runtime is free, with developers only paying for model consumption.

Who uses Inworld AI?

Inworld AI powers use-cases where realtime interaction and sophisticated agent capabilities are critical across:
1. Companions
Applications where AI companions provide ongoing, personal, and emotionally engaging interaction, whether as a language tutor, health coach, game character or best friend. Status by Wishroll became the 3rd fastest app to reach 1 million daily active users on Inworld Agent Runtime, reducing costs by 95% while maintaining average daily engagement at 1.5 hours.
2. Developer Assistants
AI assistants that help developers write, debug, and understand code through natural conversation, increasing developer productivity with realtime coding help, explanation, and automation.
3. Enterprise
Enterprise AI voice agents that automate external-facing and internal business workflows. These applications handle repeatable tasks and operational processes at large scale, such as customer support/CX, sales automation, recruiting, internal knowledge Q&A, and product or user research.
4. Learning & Education
Personalized education and training delivered through interactive, conversational experiences, across categories such as language learning and tutoring, professional training, onboarding, and skill-building. Talkpal serves 5 million language learners using Inworld TTS, achieving 40% cost reduction while improving feature usage by 7% and retention by 4%.
5. Health & Wellness
Wellbeing, care, and health-related guidance through conversational interaction, such as fitness and lifestyle coaching and mental health and spiritual support. Bible Chat scaled to ~800K daily active users with over 90% cost reduction on their TTS costs using Inworld TTS.
6. Interactive Media
AI-powered entertainment built for realtime interaction and immersion, bringing characters and narratives to life across games, IP-based experiences, interactive content (ads and avatars), news, sports & entertainment. Inworld has powered many use-cases across this vertical, working with companies such as NVIDIA, Ubisoft, NBCU, Astrobeam, Playroom and Particle.

How is Inworld AI different?

The voice AI and AI orchestration markets are fragmented across providers that each solve one part of the problem. Model-only providers offer primarily voice AI, with limited to no orchestration, observability, or experimentation capabilities. Framework-only orchestrators offer pipeline tooling but no proprietary models. Hyperscaler TTS solutions from large tech companies offer enterprise reliability, but only achieve commodity quality and high latency.
Inworld AI is the only platform that combines all layers of the consumer AI infrastructure stack in a single vertically integrated platform:
By co-designing and offering proprietary models, orchestration, routing, and observability, Inworld can offer optimizations that are impossible when stitching together horizontal tools.

Why Inworld AI matters now

The AI industry has invested over $150 billion in infrastructure, but consumer AI revenue has been slow to materialize, as the existing stack was built for enterprise. Inworld is closing that gap, with Inworld-powered consumer apps reaching millions of end-users daily.
The consumer AI economy needs dedicated infrastructure. Enterprise AI automates business processes to cut costs, but it doesn't create new consumer spending. If AI-powered interactive applications don't emerge to generate revenue growth, the AI investment cycle collapses. The companies already scaling on Inworld, such as Wishroll (3rd fastest to 1M DAUs), Talkpal (5M learners), Little Umbrella (20M players), and Bible Chat (800K DAUs), are proof that interactive AI applications can reach massive scale when the infrastructure is purpose-built.
Voice AI is becoming the primary consumer interface. Voice AI usage surged 9x in 2025. Every major hardware company is betting on voice-first devices: Meta Ray-Ban smart glasses, Apple's Siri overhaul, OpenAI's audio-first hardware with Jony Ive. The hardware is arriving. Consumer AI infrastructure is what powers it.
Big tech is consolidating voice AI into walled gardens. Google acqui-hired Hume AI's team in January 2026. Meta acquired Play AI. OpenAI has absorbed voice AI startups. Every major platform company is building voice AI for their own platforms, not for developers. Inworld is the independent, developer-first platform.
An ecosystem is forming. Inworld's Consumer AI Accelerator assembled 32 startups from 700+ applicants across 42 countries, with $50M+ combined ARR. Co-hosted with Stripe, HubSpot, Bitkraft, and Oyster. The consumer AI economy isn't a thesis, but rather forming on Inworld's infrastructure.

What is Inworld AI’s pricing?

AI models and infrastructure only work at scale if the economics are sustainable. Inworld's pricing is designed for applications where the majority of the user base may never monetize and every interaction must cost fractions of a cent.
Inworld uses a usage-based, credit-purchase model with two tiers: an On-Demand plan aimed at developers and startups, and an Enterprise plan for large-scale deployments.
On the On-Demand tier, TTS (text-to-speech) is priced at $5/million characters for the Mini model and $10/million characters for the Max model, while LLM access is billed at the provider's listed rates with no markup across 220+ models from providers like OpenAI, Anthropic, Google, Mistral, and others:
The Enterprise plan offers volume-based discounts on all products, custom rate limits, on-premises deployment, HIPAA/BAA compliance, EU and India data residency, zero data retention mode, dedicated account management, and invoicing options.
Inworld pricing is particularly attractive because there are no subscriptions or seat fees to worry about. Developers only pay for what they consume, making it easy to experiment at low cost and subsequently scale on the same plan. Zero-markup LLM pricing means developers can access a wide range of frontier models through a single API without paying a premium, while built-in features like Inworld Knowledge, Memory, Safety, and Voice Activity Detection are included at no extra charge, reducing the need to stitch together multiple third-party services.
The latest pricing can be found here.

Who founded Inworld AI?

At its core, Inworld AI is a product-oriented research lab of top AI researchers and engineers. The company was founded in 2021 by three co-founders with decades of combined experience building conversational AI infrastructure at production scale.
Kylan Gibbs, Co-founder & CEO. Previously led product for LLMs at Google DeepMind, focused on turning DeepMind’s LLMs into enterprise-grade developer platforms.
Ilya Gelfenbeyn, Co-founder & CSO. Previously co-founded API.AI, a conversational AI platform acquired by Google in 2016 and rebranded as Dialogflow (now Google Conversational Agents).
Michael Ermolenko, Co-founder & CTO. Led AI development at API.AI before it was acquired by Google.
Inworld AI maintains a research organization with backgrounds from Google, DeepMind, Meta, Apple, Cruise, Microsoft and other leading institutions. Research and open-source projects are available at github.com/inworld-ai.

How much funding has Inworld AI raised?

Inworld AI has raised over $125 million from investors including Lightspeed Venture Partners, Section 32, Kleiner Perkins, Founders Fund, CRV, Stanford University, Intel Capital, Microsoft M12, Meta, Samsung NEXT, LG Technology Ventures, and Bitkraft.

How do I get started with Inworld AI?

Inworld TTS can be accessed through the TTS Playground and via API or integration partners, with robust documentation available here. Inworld TTS-1.5-Max is recommended for most applications and Inworld TTS-1.5-Mini for hyper-latency sensitive use-cases.
Inworld STT provides realtime streaming transcription over WebSocket or batch transcription for complete audio files. Integrate with a few lines of code in Python or Node.js. Full documentation is available here.
Inworld Realtime API supports building full-duplex voice agents with speech-to-speech capabilities. Get an API key, open a WebSocket, and stream audio. Full documentation is available here, including a migration guide for developers moving from the OpenAI Realtime API.
Inworld Router provides a single API endpoint for accessing 200+ models with intelligent routing, failover, and A/B testing. Drop-in compatible with OpenAI and Anthropic SDKs. Documentation and migration guides for OpenRouter and Anthropic setups are available. Full introduction here.
Inworld Agent Runtime allows you to build agents via the Inworld Portal or CLI and deploy them as hosted endpoints. Deploy a realtime conversational AI endpoint in 3 minutes from your command line with npm install -g @inworld/cli, follow the quickstart guide or use a template.
Integrations. Inworld models can be accessed through all major platforms, including LiveKit, Vapi, Pipecat, NLX, LangChain, Ultravox, and GMI Cloud. A full list of integrations partners can be found here.
Enterprise. Contact the Inworld team for volume pricing, SLAs, on-premise deployments, custom model development, and dedicated support.

Frequently asked questions

What is Inworld AI?
Inworld AI is a realtime AI research lab and model provider, and the leading consumer AI infrastructure platform. It combines the world’s #1-ranked voice AI models (Inworld TTS) with Inworld Agent Runtime for model-agnostic realtime orchestration, integrated observability, and built-in experimentation.
What does Inworld AI do?
Inworld provides the full technology stack for building interactive AI applications at scale: #1-ranked realtime voice AI at 20x lower cost than incumbents (Inworld TTS), model-agnostic orchestration consumed through a simple API with integrated observability and experimentation (Inworld Agent Runtime), and intelligent model routing that optimizes on business outcomes like retention and engagement.
Who uses Inworld AI?
Inworld primarily serves use-cases where realtime interaction and sophisticated agent capabilities are critical, such as companion apps, developer assistants, and agents for learning & education, health & wellness, interactive media and enterprise. Its customers include AI-native startups such as Status by Wishroll (3rd fastest app to 1M DAUs), Bible Chat (~800K DAUs), Particle, Luvu, and Talkpal, and Fortune 500 brands, such as NVIDIA, NBCU, Logitech Streamlabs, among others.
How much does Inworld cost?
Inworld TTS costs $5–10 per million characters (less than half a cent per minute), yielding 20x savings vs. incumbents. Inworld STT is in Research Preview with provider rates passed through directly and no markup. Agent Runtime is free, with developers only paying for model consumption. LLM access is passed through at direct provider pricing with no markup.
What is Inworld STT?
Inworld STT is a realtime streaming speech-to-text API with diarization, custom vocabularies, and voice profiling. It provides a unified multi-provider interface for industry-leading transcription models, with semantic and acoustic VAD, word-level timestamps, and multilingual support for up to 99 languages. It integrates directly into the Inworld Realtime API for end-to-end voice pipelines. Currently in Research Preview with no markup on provider rates.
Is Inworld only for gaming?
No. Inworld's infrastructure originated in gaming, where it solved the hardest realtime AI problems at scale, but today powers production customers across six segments: companion apps, developer assistants, and agents for learning & education, health & wellness, interactive media and enterprise.
What is consumer AI infrastructure?
Consumer AI infrastructure is the technology stack purpose-built for AI applications that serve millions of users in real time, at the latency, quality, and unit economics that consumer products demand. Inworld AI is the leading consumer AI infrastructure platform.
What languages does Inworld TTS support?
15+ languages at native-speaker quality, including English, Spanish, French, German, Korean, Chinese, Japanese, Arabic, Hindi, Hebrew, Portuguese, Italian, Dutch, Polish, and Russian.
Does Inworld work with my existing LLM provider?
Yes. Agent Runtime integrates with OpenAI, Anthropic, Google, Mistral, and other LLM providers through a unified, model-agnostic interface.
Is Inworld free?
Developers pay only for model usage on the Inworld platform. Core capabilities, including Safety, Memory, and Knowledge are included at no extra cost.
Where is Inworld headquartered?
Mountain View, California, with additional presence in Vancouver, Canada.
Copyright © 2021-2026 Inworld AI