Thinking Machines Interaction Models vs Alternatives (2026)

Q: Can I use Thinking Machines interaction models in production today?

Not yet. As of May 2026, TML-Interaction-Small is in research preview with limited partner access. Thinking Machines Lab has announced wider availability later in 2026, but no specific date has been confirmed. If you need a production-ready voice AI API today, Inworld's Realtime API is available with expressive, realtime-optimized Realtime TTS, OpenAI protocol compatibility, image content parts, and Router access to 220+ LLMs through a single endpoint.

Thinking Machines Lab released TML-Interaction-Small on May 11, 2026, introducing a new category they call "interaction models." Built by a team led by Mira Murati (former OpenAI CTO), the model processes audio, video, and text simultaneously and can listen while speaking. Inworld AI takes a different architectural approach: a modular pipeline with expressive, realtime-optimized Realtime TTS, Router access to 220+ LLMs, and a production-ready Realtime API available today. Both approaches solve the same problem from different angles, and both have real trade-offs worth understanding.

What Are Thinking Machines Interaction Models?

Thinking Machines Lab defines "interaction models" as AI systems designed from the ground up for real-time human interaction rather than text generation. Their first release, TML-Interaction-Small, is a 276B parameter mixture-of-experts (MoE) model with 12B active parameters at any given time.

The key architectural idea: the model processes input in continuous 200ms chunks rather than discrete conversational turns. Audio, video, and text flow into the model simultaneously, and the model can generate spoken output while still processing incoming audio. This is different from turn-based voice APIs where one party speaks at a time.

A background reasoning model handles complex tasks like tool calls and multi-step thinking asynchronously, without blocking the real-time conversation loop.

On Thinking Machines' self-reported benchmarks:

Turn-taking latency: 0.40s (vs GPT-Realtime-2 at 1.18s, Gemini 3.1 Flash Live at 0.57s)
FD-bench V1.5 interaction quality: 77.8 (vs GPT-Realtime-2 at 46.8, Gemini 3.1 Flash Live at 45.5)

These numbers are impressive if they hold up. The caveat: they have not been independently verified as of May 2026. FD-bench is a relatively new benchmark and hasn't seen broad adoption across the industry. Independent testing by third parties will be important for validating these results.

The model is currently in research preview with limited partner access. Wider availability is planned for later in 2026.

How Does Thinking Machines Compare to OpenAI Realtime?

Both OpenAI's Realtime API (GPT-Realtime-2) and Thinking Machines' TML-Interaction-Small use natively multimodal architectures, but they differ in scope and availability.

Architecture. GPT-Realtime-2 is natively multimodal but still operates in a turn-based paradigm. One party speaks, the other listens. TML-Interaction-Small processes in continuous 200ms chunks, allowing the model to listen and speak simultaneously. This is a meaningful architectural difference if your use case requires overlapping speech.

Latency. Thinking Machines reports 0.40s turn-taking latency versus 1.18s for GPT-Realtime-2 (self-reported, not independently verified). If confirmed, that is a significant improvement.

Availability. GPT-Realtime-2 is generally available. TML-Interaction-Small is in research preview with limited partner access.

Protocol. OpenAI's Realtime API uses WebSocket and WebRTC with a well-documented event schema that has become a de facto standard. Other providers (including Inworld and xAI) have built compatible implementations. Thinking Machines has not published protocol documentation for external developers yet.

Ecosystem. OpenAI's Realtime API integrates with the broader OpenAI ecosystem (function calling, MCP, image inputs). Thinking Machines' ecosystem is nascent.

How Does Inworld's Approach Differ from Interaction Models?

This is the core architectural question: native multimodal vs modular pipeline. Both are valid. The trade-offs are real.

Thinking Machines: native multimodal. A single model handles everything. Audio understanding, reasoning, and audio generation happen inside one neural network. The advantage is no inter-component latency. The model can potentially achieve tighter coupling between what it hears and how it responds. The disadvantage is that you're locked to that one model for every part of the pipeline.

Inworld: modular pipeline. Separate best-in-class components for each stage. Realtime TTS is a first-party realtime voice model tuned for expressive, low-latency speech: Realtime TTS-2 research preview adds natural-language style control, and Realtime TTS 1.5 Max delivers sub-250ms P90 time-to-first-audio. The Inworld Router accesses 220+ LLMs from OpenAI, Anthropic, Google, and other providers through a single API key. The Realtime API handles VAD, turn detection, interruption, and image content parts natively over WebSocket and WebRTC.

The advantage of modular: you can swap any component independently. When a better LLM ships, route to it. When Thinking Machines models become available through an API, route to those too. You're not locked to a single vendor's model for reasoning, voice quality, or speech recognition.

Which Voice AI Approach Should I Use?

The honest answer depends on what you're building and when you need it.

Choose Thinking Machines if:

You need simultaneous listen-and-speak capability (overlapping conversation, not turn-based)
Your use case benefits from native video input alongside audio
You have access to the research preview and can tolerate pre-production stability
You're willing to wait for wider availability and build on an unproven ecosystem

Choose Inworld if:

You need to ship to production today, not later in 2026
You want expressive, low-latency realtime voice quality. Realtime TTS-2 research preview adds natural-language style control, and Realtime TTS 1.5 Max delivers sub-250ms P90 time-to-first-audio
You want model flexibility. The Inworld Router lets you swap LLMs across 220+ models without changing your integration. If Thinking Machines models become available through an API, you could route to them too
You want OpenAI protocol compatibility for easy migration from existing integrations
You need WebSocket and WebRTC transport options
Your application requires proven compliance (SOC 2 Type II, GDPR)

Choose OpenAI Realtime if:

You're already deep in the OpenAI ecosystem and want the tightest model integration
You need the most mature production track record in this category
Your application requires 50+ languages

Choose Gemini Flash Live if:

You need 90+ language support
You're building on Google Cloud and want deep GCP integration
Native video input matters for your use case

What Happens When Interaction Models Become Widely Available?

This is where architectural choices matter most. If Thinking Machines delivers on their benchmarks and ships a production API, teams locked into a single-model architecture will face a hard choice: rebuild on the new model, or stay on the old one.

Teams using a modular, model-agnostic architecture avoid that problem entirely. The Inworld Router already routes across 220+ models. Adding a new provider is a routing decision, not a re-architecture. You could use Thinking Machines for reasoning while keeping Realtime TTS for voice quality, or route different user segments to different models for A/B testing.

This is the core argument for modular over native multimodal: the best model today probably won't be the best model in six months. An architecture that lets you adopt new models without rebuilding your integration compounds its advantage over time.

The Bottom Line

Thinking Machines' interaction models represent a genuinely novel approach to voice AI. Processing audio in continuous 200ms chunks and enabling overlapping speech is architecturally interesting. The self-reported benchmarks, if independently confirmed, would represent a meaningful improvement in turn-taking latency.

But "architecturally interesting" and "production-ready" are different things. As of May 2026, TML-Interaction-Small is in research preview with limited access, no public API documentation, and unverified benchmarks. That will change, potentially quickly. Mira Murati's track record at OpenAI suggests this team can execute.

If you need to build and ship voice AI today, the Inworld Realtime API is available with expressive, realtime-optimized Realtime TTS, Router access to 220+ LLMs, and an architecture that can integrate new models, including Thinking Machines, as they become available.

Get started with the Inworld Realtime API

FAQs

What are Thinking Machines interaction models?

Interaction models are a new category of AI model from Thinking Machines Lab, founded by Mira Murati. The first release, TML-Interaction-Small, is a 276B parameter mixture-of-experts model (12B active parameters) that processes audio, video, and text simultaneously in continuous 200ms chunks. Unlike traditional voice APIs that process speech in discrete turns, interaction models can listen while speaking. As of May 2026, the model is in research preview with limited partner access.

How does TML-Interaction-Small compare to GPT-Realtime-2?

On Thinking Machines' self-reported benchmarks, TML-Interaction-Small shows 0.40s turn-taking latency versus 1.18s for GPT-Realtime-2, and scores 77.8 on FD-bench V1.5 versus 46.8. These benchmarks have not been independently verified. GPT-Realtime-2 is GA in production, while TML-Interaction-Small remains in research preview.

What is the difference between native multimodal and modular voice AI?

Native multimodal models like TML-Interaction-Small process audio and text inside a single model with no separate STT or TTS stage. Modular pipelines like Inworld's Realtime API chain best-in-class components: a dedicated STT model, an LLM router across 220+ models, and expressive, realtime-optimized Realtime TTS. Native multimodal can achieve lower theoretical latency. Modular pipelines let you swap any component independently.

Can I use Thinking Machines interaction models in production today?

Not yet. TML-Interaction-Small is in research preview with limited partner access as of May 2026. Wider availability is planned for later in 2026. For a production-ready voice AI API available today, the Inworld Realtime API supports WebSocket and WebRTC with OpenAI protocol compatibility and accepts image content parts alongside audio.

Could Inworld integrate Thinking Machines models?

Yes. Inworld's architecture is model-agnostic. The Inworld Router already routes across 220+ models. When Thinking Machines models become available through an API, they could be added as another routing option without requiring changes to your integration code.

Who founded Thinking Machines Lab?

Thinking Machines Lab was founded by Mira Murati, former CTO of OpenAI. The company focuses on building AI models optimized for real-time human interaction.

What benchmarks does TML-Interaction-Small report?

Thinking Machines Lab reports 0.40s turn-taking latency and 77.8 on FD-bench V1.5 interaction quality. They compare against GPT-Realtime-2 (1.18s, 46.8) and Gemini 3.1 Flash Live (0.57s, 45.5). These are self-reported numbers and have not been independently verified as of May 2026.

Thinking Machines Interaction Models vs Alternatives: Native Multimodal vs Modular Voice AI