Get started
Inworld Compute

NVIDIA B200 GPUs. Available now.

Dedicated NVIDIA B200 GPUs for custom, high-volume realtime workloads. Run any model with Kubernetes access or let us handle deployment. Integrates with Inworld TTS, STT, Router, and Realtime API. From $6 per GPU hour.
GPU compute
You
gpu B200 x4memory 768 GB HBM3eprice $6/GPU/hr

Deploy my fine-tuned Llama 70B on 4x B200s. Optimize for lowest latency.

Inworld
status livethroughput ~4x H100

Deployed and optimized. Live at api.inworld.ai/v1/chat/completions. Routable alongside OpenAI, Anthropic, and hundreds of other models.

Why Inworld Compute

Dedicated GPU capacity for LLM inference, dedicated TTS serving, custom model hosting, and high-concurrency realtime workloads. Fully integrated with the Inworld voice AI stack.

Available today

NVIDIA B200 Blackwell GPUs ready to deploy. No waitlist. Global supply is constrained through mid-2026, but we have capacity now.

Router-compatible

Deploy a model and it appears in Inworld Router alongside hundreds of provider models. One API, one key. OpenAI SDK compatible.

Dedicated TTS serving

Run the #1-ranked TTS on dedicated GPUs with guaranteed capacity. Sub-200ms latency at any scale, no shared tenancy.

Expert optimization

We select and tune the inference stack for your workload. Quantization, caching, batching, and parallelism optimized by the team that runs the #1-ranked TTS.

Transparent pricing

From $6 per GPU hour. On-demand or provisioned capacity with volume pricing. No egress fees, no hidden charges.

Production-ready

SOC 2 Type II. Isolated tenancy. 24/7 monitoring with auto-recovery. No noisy neighbors, no cold starts, no surprise evictions.

Two ways to deploy

Kubernetes

Full cluster access with dedicated B200 GPU nodes. Any container, any serving framework, any model. Full control over networking, storage, and orchestration.
# Dedicated B200 nodes in your own Kubernetes cluster apiVersion: apps/v1 kind: Deployment metadata: name: my-model spec: replicas: 2 template: spec: containers: - name: inference image: my-registry/my-model-server:latest resources: limits: nvidia.com/gpu: 4 # 4x B200 = 768 GB HBM3e nodeSelector: nvidia.com/gpu.product: B200 # Your cluster, your rules: # - Any serving framework # - Any orchestration # - Any monitoring # - Any model (OSS, fine-tuned, proprietary) # - 8 GPUs per node, 1.5 TB aggregate memory

Managed deployment

Give us any open-source or custom model. We configure the optimal inference stack, deploy on dedicated B200s, and expose it through the Router API. You call the API.
# We deploy and optimize. You call the API. import openai client = openai.OpenAI( base_url="https://api.inworld.ai/v1", api_key="your_inworld_api_key", ) # Your model on dedicated B200s, served through # the same API as OpenAI, Anthropic, Google response = client.chat.completions.create( model="my-org/finetuned-llama-70b", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum computing."}, ], stream=True, ) for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="")

FAQ

NVIDIA B200 GPUs (Blackwell architecture). Each GPU has 192 GB HBM3e memory with 8 TB/s bandwidth. Nodes have 8 GPUs each for 1.5 TB aggregate memory. B200s deliver approximately 4x the inference throughput of H100s.
Yes. With Kubernetes access, deploy any containerized model using any serving framework you choose. With managed deployment, give us any open-source or custom model and we select the best inference stack, configure optimization, and deploy it on dedicated B200s.
Models on your dedicated B200s are accessible through the Inworld Router API. Your self-hosted models sit alongside hundreds of provider models (OpenAI, Anthropic, Google, and others) behind a single OpenAI-compatible endpoint. Route traffic between your models and external providers with automatic failover.
Kubernetes gives you full cluster access with dedicated B200 nodes. You choose the serving framework, orchestration tooling, and monitoring stack. Managed deployment means we handle everything: inference stack selection, optimization, deployment, and monitoring. You access the model through the Router API. Both options are compatible with all Inworld APIs.
Any model that fits in GPU memory. A single B200 node (8 GPUs, 1.5 TB aggregate) serves models up to approximately 400B parameters. Larger models distribute across multiple nodes. Common workloads include open-source LLMs (Llama, Mistral, DeepSeek), fine-tuned models, and custom inference pipelines.
Starting from $6 per GPU hour for NVIDIA B200s. On-demand capacity available immediately. Provisioned capacity with guaranteed allocation and volume pricing for longer commitments. Contact our team for enterprise configurations and SLAs.
B200 GPUs are available now and can generally be provisioned within days, depending on cluster size and current demand. We are one of the few providers with B200 capacity ready to deploy while global supply remains constrained through mid-2026. Availability is first-come, first-served and demand is increasing as more teams move to Blackwell for inference cost savings. Talk to our team today to reserve capacity before the current allocation is committed.

Ready to get started?

Dedicated B200 GPUs for LLM inference, TTS serving, custom models, and high-volume realtime workloads. Kubernetes or managed deployment, fully integrated with the Inworld voice AI stack.
Copyright © 2021-2026 Inworld AI