Inworld Compute

NVIDIA B200 GPUs. Available now.

Dedicated NVIDIA B200 GPUs for custom, high-volume realtime workloads. Run any model with Kubernetes access or let us handle deployment. Integrates with Realtime TTS, STT, Router, and Realtime API. From $5 per GPU hour.

Talk to our team

GPU compute

You

gpu B200 x4memory 768 GB HBM3eprice $5/GPU/hr

Deploy GLM 5.1 on 4x B200s. Optimize for lowest latency.

Inworld

status livethroughput ~4x H100

Deployed and optimized. Live at api.inworld.ai/v1/chat/completions. Routable alongside OpenAI, Anthropic, and hundreds of other models.

Why Inworld Compute

Dedicated GPU capacity for LLM inference, dedicated TTS serving, custom model hosting, and high-concurrency realtime workloads. Fully integrated with the Inworld voice AI stack.

Available today

NVIDIA B200 Blackwell GPUs ready to deploy. No waitlist. Global supply is constrained through mid-2026, but we have capacity now.

Router-compatible

Deploy a model and it appears in Inworld Router alongside hundreds of provider models. One API, one key. OpenAI SDK compatible.

Dedicated TTS serving

Run the #1 realtime TTS on dedicated GPUs with guaranteed capacity. Sub-200ms latency at any scale, no shared tenancy.

Expert optimization

We select and tune the inference stack for your workload. Quantization, caching, batching, and parallelism optimized by the team that runs the #1 realtime TTS.

Transparent pricing

From $5 per GPU hour. On-demand or provisioned capacity with volume pricing. No egress fees, no hidden charges.

Production-ready

SOC 2 Type II. Isolated tenancy. 24/7 monitoring with auto-recovery. No noisy neighbors, no cold starts, no surprise evictions.

Two ways to deploy

Kubernetes

Dedicated B200 nodes. Any container, any framework, any model. Full control.

# Dedicated B200 nodes in your own Kubernetes cluster

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-model
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: inference
          image: my-registry/my-model-server:latest
          resources:
            limits:
              nvidia.com/gpu: 4  # 4x B200 = 768 GB HBM3e
      nodeSelector:
        nvidia.com/gpu.product: B200

# Your cluster, your rules:
# - Any serving framework
# - Any orchestration
# - Any monitoring
# - Any model (OSS, fine-tuned, proprietary)
# - 8 GPUs per node, 1.5 TB aggregate memory

Managed deployment

Give us any model. We optimize, deploy on B200s, and expose via Router API.

# We deploy and optimize. You call the API.

import openai

client = openai.OpenAI(
    base_url="https://api.inworld.ai/v1",
    api_key="your_inworld_api_key",
)

# Your model on dedicated B200s, served through
# the same API as OpenAI, Anthropic, Google
response = client.chat.completions.create(
    model="my-org/glm-5.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing."},
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

FAQ

NVIDIA B200 GPUs (Blackwell architecture). Each GPU has 192 GB HBM3e memory with 8 TB/s bandwidth. Nodes have 8 GPUs each for 1.5 TB aggregate memory. B200s deliver approximately 4x the inference throughput of H100s.

Yes. With Kubernetes access, deploy any containerized model using any serving framework you choose. With managed deployment, give us any open-source or custom model and we select the best inference stack, configure optimization, and deploy it on dedicated B200s.

Models on your dedicated B200s are accessible through the Inworld Router API. Your self-hosted models sit alongside hundreds of provider models (OpenAI, Anthropic, Google, and others) behind a single OpenAI-compatible endpoint. Route traffic between your models and external providers with automatic failover.

Kubernetes gives you full cluster access with dedicated B200 nodes. You choose the serving framework, orchestration tooling, and monitoring stack. Managed deployment means we handle everything: inference stack selection, optimization, deployment, and monitoring. You access the model through the Router API. Both options are compatible with all Inworld APIs.

Any model that fits in GPU memory. A single B200 node (8 GPUs, 1.5 TB aggregate) serves models up to approximately 400B parameters. Larger models distribute across multiple nodes. Common workloads include open-source LLMs (Llama, Mistral, DeepSeek), fine-tuned models, and custom inference pipelines.

Starting from $5 per GPU hour for NVIDIA B200s. On-demand capacity available immediately. Provisioned capacity with guaranteed allocation and volume pricing for longer commitments. Contact our team for enterprise configurations and SLAs.

B200 GPUs are available now and can generally be provisioned within days, depending on cluster size and current demand. We are one of the few providers with B200 capacity ready to deploy while global supply remains constrained through mid-2026. Availability is first-come, first-served and demand is increasing as more teams move to Blackwell for inference cost savings. Talk to our team today to reserve capacity before the current allocation is committed.

Ready to get started?

Dedicated B200 GPUs for LLM inference, TTS serving, custom models, and high-volume realtime workloads. Kubernetes or managed deployment, fully integrated with the Inworld voice AI stack.

Talk to our team Sign up

Products

Developers

Socials