Get started
Published 04.13.2026

NVIDIA B200 GPU: Specs, Pricing, and Cloud Availability (2026)

Last updated: April 13, 2026
The NVIDIA B200 GPU is the flagship data center accelerator in the Blackwell generation, built on a dual-die GB100 design with 208 billion transistors. It delivers 192 GB of HBM3e memory at 8 TB/s bandwidth and up to 9,000 TFLOPS of FP4 Tensor performance, roughly 4x the inference throughput of the H100 it replaces. Inworld Compute has B200 capacity available now, starting from $6/hr, while most hardware orders remain backordered through mid-2026 with an estimated 3.6 million units in the queue. Cloud rental is currently the fastest way to access B200 GPUs, with pricing ranging from $2.65/hr (reserved) to $14.24/hr (on-demand) depending on the provider.

What are the full specs of the NVIDIA B200 GPU?

The B200 uses NVIDIA's Blackwell architecture, which pairs two GB100 dies on a single module connected by a 10 TB/s chip-to-chip interconnect. This dual-die approach delivers substantially more compute density than Hopper while maintaining a single-GPU programming model.
The fifth-generation Tensor Cores introduce native FP4 support, which is the critical spec for LLM inference. FP4 halves the memory footprint compared to FP8 while maintaining acceptable quality for most inference workloads, effectively doubling the model size you can serve on a single GPU.
NVLink 5 at 1.8 TB/s enables efficient multi-GPU scaling for models that exceed 192 GB. An 8-GPU B200 NVLink domain provides 1.5 TB of aggregate memory at 14.4 TB/s of bisection bandwidth, enough to serve trillion-parameter models without PCIe bottlenecks.

How does the B200 compare to H100 and A100 for inference?

The generational leap from H100 to B200 is the largest NVIDIA has shipped in the data center segment. The improvements are concentrated in memory capacity, bandwidth, and low-precision Tensor Core throughput, which are exactly the three bottlenecks that limit LLM inference.
The 4x inference throughput improvement comes from three compounding factors. First, 2.4x more memory bandwidth means the GPU can read model weights faster during the decode phase, which is the primary bottleneck for autoregressive text generation. Second, FP4 Tensor Cores reduce the bytes-per-parameter by 2x compared to FP8, doubling effective bandwidth utilization. Third, the larger 192 GB memory eliminates the need for tensor parallelism on models up to ~96B parameters (FP16) or ~192B (FP8), removing inter-GPU communication overhead entirely.
The cost-per-token improvement is even more dramatic. At roughly $0.02 per million tokens versus $0.14 on H100, B200 delivers a 7x reduction in inference cost despite the higher hourly rental price. This is because the throughput gains outpace the price premium by a wide margin.

Where can you rent B200 GPUs in 2026?

Cloud rental is the fastest path to B200 access. Hardware purchases through NVIDIA or OEM partners carry multi-month lead times, and the estimated backlog stands at approximately 3.6 million units as of April 2026.
Here is what B200 cloud pricing looks like across providers with available capacity.
Pricing as of April 2026. Rates change frequently. Verify current pricing directly with each provider.
Pricing is expected to stabilize around $2.50-3.00/hr at major providers by Q4 2026 as TSMC ramps Blackwell production and more supply enters the market. Until then, reserved contracts and smaller cloud providers offer the best value.
Key considerations when choosing a provider:
  • Availability speed. Some providers have on-demand capacity now. Others require reserved contracts or waitlists. If you need GPUs this week, prioritize Lambda, RunPod, or Inworld Compute.
  • Billing model. Modal and RunPod offer per-second billing, which is better for burst inference workloads. Reserved contracts from CoreWeave deliver the lowest hourly rate but require upfront commitment.
  • Networking. Multi-GPU workloads (training or serving models >192B parameters) need high-bandwidth GPU-to-GPU interconnects. CoreWeave and Lambda provide NVLink-connected clusters. Verify NVLink availability before committing to a provider for multi-node jobs.
  • Managed inference vs raw compute. Fireworks provides optimized inference endpoints where you deploy a model and call an API. Lambda, CoreWeave, RunPod, and Inworld Compute provide raw GPU instances where you manage the inference stack.

Why is B200 availability so constrained?

Three factors are driving the supply shortage.
Demand exceeds manufacturing capacity. Every major hyperscaler, AI lab, and enterprise is competing for Blackwell GPUs simultaneously. The shift from training-dominated GPU demand (where a few large clusters suffice) to inference-dominated demand (where every production deployment needs GPUs continuously) has multiplied the total addressable market for high-end accelerators.
TSMC 4NP production ramp. The B200's dual-die GB100 design requires two large dies per GPU, each manufactured on TSMC's 4NP process. Yields on large dies are inherently lower, and the dual-die packaging adds complexity. TSMC is ramping capacity, but the estimated backlog of roughly 3.6 million units reflects the gap between orders placed and chips shipped.
Inference economics are compelling. At roughly $0.02 per million tokens versus $0.14 on H100, the B200 pays for itself quickly for high-throughput inference workloads. This creates a rational incentive for every organization running LLM inference at scale to upgrade, further concentrating demand on the newest generation.
The practical implication: if you need B200 access in the next 30-60 days, cloud rental from providers with existing inventory is the only reliable option. Hardware procurement timelines remain measured in quarters, not weeks.

What makes B200 memory bandwidth critical for LLM inference?

LLM inference is fundamentally a memory-bandwidth problem, not a compute problem. Understanding why explains both the B200's performance advantage and how to evaluate GPU options for inference workloads.
During autoregressive text generation (the decode phase), the GPU must read the entire model's weight matrix from memory for every single output token. A 70B parameter model in FP16 occupies 140 GB. Generating one token requires reading all 140 GB. At 100 tokens per second, that is 14 TB/s of memory reads just for the weights, before accounting for KV cache reads or any computation.
This is why memory bandwidth is the binding constraint:
  • A100 at 2 TB/s: Can sustain roughly 14 tokens/second on a 70B FP16 model (bandwidth-limited, ignoring compute)
  • H100 at 3.35 TB/s: Roughly 24 tokens/second on the same model
  • B200 at 8 TB/s: Roughly 57 tokens/second on the same model
The B200's 8 TB/s bandwidth advantage compounds with its FP4 Tensor Core support. Running the same 70B model in FP4 reduces the weight size to 35 GB, which means the B200 can sustain roughly 228 tokens/second on a single GPU. This is the math behind the "4x inference throughput" figure.
For voice AI applications, this translates directly to lower latency and higher concurrency. A TTS model that needs 50ms per inference step on H100 might need only 12-15ms on B200. An STT pipeline processing multiple audio streams in parallel can serve 4x more concurrent sessions per GPU.

How to get started with B200 GPUs

The right path depends on your workload.
For LLM inference at scale: If you are serving a large language model (70B+ parameters) in production and need the lowest cost-per-token, B200 is the clear choice. Start with a single B200 instance from a provider with on-demand availability. Benchmark your model's throughput in FP8 and FP4 to quantify the improvement over your current H100 or A100 deployment.
For model training: B200's 8 TB/s bandwidth and 1.8 TB/s NVLink 5 make it excellent for training, but the value proposition is strongest for inference. If training is your primary workload, evaluate whether the hourly premium over H100 is justified by your training-time reduction.
For voice AI and realtime inference: The B200's bandwidth advantage is particularly impactful for latency-sensitive workloads like text-to-speech, speech-to-text, and realtime voice pipelines. Inworld Compute offers B200 instances optimized for voice AI inference workloads. Visit inworld.ai/compute or contact the team to discuss capacity and configuration.
For evaluation and prototyping: Several providers offer per-second or per-minute billing. Modal ($6.25/hr, per-second) and RunPod ($5.98/hr) are cost-effective for short evaluation runs where you want to benchmark B200 performance without a long-term commitment.
Regardless of provider, the key steps are:
  1. Quantize your model. B200's FP4 Tensor Cores deliver the best throughput-per-dollar, but your model must be quantized to FP4 or FP8. Test quality degradation on your specific use case before committing.
  2. Benchmark end-to-end latency. Throughput (tokens/second) matters, but so does time-to-first-token and P99 latency. Measure both under realistic concurrency.
  3. Calculate cost-per-token. Compare the total cost (hourly rate x time per request) against your current GPU. The 4x throughput improvement should deliver a meaningful cost reduction even at B200's higher hourly rate.
  4. Plan for multi-GPU if needed. Models above ~96B parameters (FP16) or ~192B (FP8) still require tensor parallelism. Ensure your provider offers NVLink-connected B200 clusters, not just isolated instances.

Why run B200 workloads on Inworld Compute?

Most B200 cloud providers sell hardware. Inworld AI is a research lab that runs production GPU inference at scale every day, and that expertise comes with the hardware.
The same team that built the #1-ranked TTS model on Artificial Analysis serves millions of realtime voice interactions on B200 GPUs. Building the world's best speech synthesis required solving the hard inference problems: maximizing tokens per second per GPU, minimizing time-to-first-token, tuning FP4/FP8 quantization without quality loss, and keeping P99 latency tight under production concurrency. That expertise applies directly to your workload.
What you get beyond the GPU:
  • Inference optimization expertise. We do not just provision hardware. With managed deployment, we select the right inference stack, configure quantization and caching, tune parallelism, and monitor throughput. You get an API endpoint, not a cluster to babysit.
  • Router integration. Deploy a model on Inworld B200s and it appears in Inworld Router alongside hundreds of provider models (OpenAI, Anthropic, Google, and more). One API for your self-hosted models and the rest of the ecosystem.
  • Voice AI stack on the same infrastructure. Combine your LLM workloads with Inworld TTS (#1 ranked), STT, and the Realtime API for end-to-end voice pipelines running on dedicated B200s.
  • Available now. B200 capacity ready to deploy within hours while most providers have waitlists through mid-2026.
  • SOC 2 Type II, isolated tenancy. Dedicated GPU nodes per customer. No shared hardware, no cold starts, no evictions.
Visit inworld.ai/compute or talk to our team to get started.
All specifications from NVIDIA's published Blackwell architecture documentation. Cloud pricing as of April 2026 and subject to change. Verify current availability and pricing directly with each provider.
Copyright © 2021-2026 Inworld AI