Thinking Machines Lab released TML-Interaction-Small on May 11, 2026, introducing a new category they call "interaction models." Built by a team led by Mira Murati (former OpenAI CTO), the model processes audio, video, and text simultaneously and can listen while speaking. Inworld AI takes a different architectural approach: a modular pipeline with the
#1-ranked TTS on Artificial Analysis, access to hundreds of LLMs through the
Realtime Router, and a production-ready
Realtime API available today. Both approaches solve the same problem from different angles, and both have real trade-offs worth understanding.
What Are Thinking Machines Interaction Models?
Thinking Machines Lab defines "interaction models" as AI systems designed from the ground up for real-time human interaction rather than text generation. Their first release, TML-Interaction-Small, is a 276B parameter mixture-of-experts (MoE) model with 12B active parameters at any given time.
The key architectural idea: the model processes input in continuous 200ms chunks rather than discrete conversational turns. Audio, video, and text flow into the model simultaneously, and the model can generate spoken output while still processing incoming audio. This is different from turn-based voice APIs where one party speaks at a time.
A background reasoning model handles complex tasks like tool calls and multi-step thinking asynchronously, without blocking the real-time conversation loop.
On Thinking Machines' self-reported benchmarks:
- Turn-taking latency: 0.40s (vs GPT-Realtime-2 at 1.18s, Gemini 3.1 Flash Live at 0.57s)
- FD-bench V1.5 interaction quality: 77.8 (vs GPT-Realtime-2 at 46.8, Gemini 3.1 Flash Live at 45.5)
These numbers are impressive if they hold up. The caveat: they have not been independently verified as of May 2026. FD-bench is a relatively new benchmark and hasn't seen broad adoption across the industry. Independent testing by third parties will be important for validating these results.
The model is currently in research preview with limited partner access. Wider availability is planned for later in 2026.
How Does Thinking Machines Compare to OpenAI Realtime?
Both OpenAI's Realtime API (GPT-Realtime-2) and Thinking Machines' TML-Interaction-Small use natively multimodal architectures, but they differ in scope and availability.
Architecture. GPT-Realtime-2 is natively multimodal but still operates in a turn-based paradigm. One party speaks, the other listens. TML-Interaction-Small processes in continuous 200ms chunks, allowing the model to listen and speak simultaneously. This is a meaningful architectural difference if your use case requires overlapping speech.
Latency. Thinking Machines reports 0.40s turn-taking latency versus 1.18s for GPT-Realtime-2 (self-reported, not independently verified). If confirmed, that is a significant improvement.
Availability. GPT-Realtime-2 is generally available. TML-Interaction-Small is in research preview with limited partner access.
Protocol. OpenAI's Realtime API uses WebSocket and WebRTC with a well-documented event schema that has become a de facto standard. Other providers (including Inworld and xAI) have built compatible implementations. Thinking Machines has not published protocol documentation for external developers yet.
Ecosystem. OpenAI's Realtime API integrates with the broader OpenAI ecosystem (function calling, MCP, image inputs). Thinking Machines' ecosystem is nascent.
How Does Inworld's Approach Differ from Interaction Models?
This is the core architectural question: native multimodal vs modular pipeline. Both are valid. The trade-offs are real.
Thinking Machines: native multimodal. A single model handles everything. Audio understanding, reasoning, and audio generation happen inside one neural network. The advantage is no inter-component latency. The model can potentially achieve tighter coupling between what it hears and how it responds. The disadvantage is that you're locked to that one model for every part of the pipeline.
Inworld: modular pipeline. Separate best-in-class components for each stage.
Realtime TTS ranks #1 on the
Artificial Analysis Speech Arena (~1,208 ELO). Inworld holds 2 of the top 5 positions on that leaderboard. The
Realtime Router accesses hundreds of LLMs from OpenAI, Anthropic, Google, and other providers through a single API key. The
Realtime API handles VAD, turn detection, and interruption natively over WebSocket and WebRTC.
The advantage of modular: you can swap any component independently. When a better LLM ships, route to it. When Thinking Machines models become available through an API, route to those too. You're not locked to a single vendor's model for reasoning, voice quality, or speech recognition.
Which Voice AI Approach Should I Use?
The honest answer depends on what you're building and when you need it.
Choose Thinking Machines if:
- You need simultaneous listen-and-speak capability (overlapping conversation, not turn-based)
- Your use case benefits from native video input alongside audio
- You have access to the research preview and can tolerate pre-production stability
- You're willing to wait for wider availability and build on an unproven ecosystem
Choose Inworld if:
- You need to ship to production today, not later in 2027
- You want the highest-ranked TTS quality available. Realtime TTS ranks #1 on the Artificial Analysis Speech Arena, and Inworld holds 2 of the top 5 positions
- You want model flexibility. The Realtime Router lets you swap LLMs without changing your integration. If Thinking Machines models become available through an API, you could route to them too
- You want OpenAI protocol compatibility for easy migration from existing integrations
- You need WebSocket and WebRTC transport options
- Your application requires proven compliance (SOC 2 Type II, GDPR)
Choose OpenAI Realtime if:
- You're already deep in the OpenAI ecosystem and want the tightest model integration
- You need the most mature production track record in this category
- Your application requires 50+ languages
Choose Gemini Flash Live if:
- You need 90+ language support
- You're building on Google Cloud and want deep GCP integration
- Native video input matters for your use case
What Happens When Interaction Models Become Widely Available?
This is where architectural choices matter most. If Thinking Machines delivers on their benchmarks and ships a production API, teams locked into a single-model architecture will face a hard choice: rebuild on the new model, or stay on the old one.
Teams using a modular, model-agnostic architecture avoid that problem entirely. The Inworld
Realtime Router already routes to hundreds of models. Adding a new provider is a routing decision, not a re-architecture. You could use Thinking Machines for reasoning while keeping Realtime TTS for voice quality, or route different user segments to different models for A/B testing.
This is the core argument for modular over native multimodal: the best model today probably won't be the best model in six months. An architecture that lets you adopt new models without rebuilding your integration compounds its advantage over time.
The Bottom Line
Thinking Machines' interaction models represent a genuinely novel approach to voice AI. Processing audio in continuous 200ms chunks and enabling overlapping speech is architecturally interesting. The self-reported benchmarks, if independently confirmed, would represent a meaningful improvement in turn-taking latency.
But "architecturally interesting" and "production-ready" are different things. As of May 2026, TML-Interaction-Small is in research preview with limited access, no public API documentation, and unverified benchmarks. That will change, potentially quickly. Mira Murati's track record at OpenAI suggests this team can execute.
If you need to build and ship voice AI today, the
Inworld Realtime API is available in production with the
#1-ranked TTS, access to hundreds of LLMs, and an architecture that can integrate new models, including Thinking Machines, as they become available.
FAQs
What are Thinking Machines interaction models?
Interaction models are a new category of AI model from Thinking Machines Lab, founded by Mira Murati. The first release, TML-Interaction-Small, is a 276B parameter mixture-of-experts model (12B active parameters) that processes audio, video, and text simultaneously in continuous 200ms chunks. Unlike traditional voice APIs that process speech in discrete turns, interaction models can listen while speaking. As of May 2026, the model is in research preview with limited partner access.
How does TML-Interaction-Small compare to GPT-Realtime-2?
On Thinking Machines' self-reported benchmarks, TML-Interaction-Small shows 0.40s turn-taking latency versus 1.18s for GPT-Realtime-2, and scores 77.8 on FD-bench V1.5 versus 46.8. These benchmarks have not been independently verified. GPT-Realtime-2 is GA in production, while TML-Interaction-Small remains in research preview.
What is the difference between native multimodal and modular voice AI?
Native multimodal models like TML-Interaction-Small process audio and text inside a single model with no separate STT or TTS stage. Modular pipelines like Inworld's
Realtime API chain best-in-class components: a dedicated STT model, an LLM router that accesses hundreds of models, and the #1-ranked TTS on
Artificial Analysis. Native multimodal can achieve lower theoretical latency. Modular pipelines let you swap any component independently.
Can I use Thinking Machines interaction models in production today?
Not yet. TML-Interaction-Small is in research preview with limited partner access as of May 2026. Wider availability is planned for later in 2026. For a production-ready voice AI API available today, the Inworld
Realtime API supports WebSocket and WebRTC with OpenAI protocol compatibility.
Could Inworld integrate Thinking Machines models?
Yes. Inworld's architecture is model-agnostic. The
Realtime Router already routes to hundreds of models. When Thinking Machines models become available through an API, they could be added as another routing option without requiring changes to your integration code.
Who founded Thinking Machines Lab?
Thinking Machines Lab was founded by Mira Murati, former CTO of OpenAI. The company focuses on building AI models optimized for real-time human interaction.
What benchmarks does TML-Interaction-Small report?
Thinking Machines Lab reports 0.40s turn-taking latency and 77.8 on FD-bench V1.5 interaction quality. They compare against GPT-Realtime-2 (1.18s, 46.8) and Gemini 3.1 Flash Live (0.57s, 45.5). These are self-reported numbers and have not been independently verified as of May 2026.