
Deploy my fine-tuned Llama 70B on 4x B200s. Optimize for lowest latency.
Deployed and optimized. Live at api.inworld.ai/v1/chat/completions. Routable alongside OpenAI, Anthropic, and hundreds of other models.
# Dedicated B200 nodes in your own Kubernetes cluster
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-model
spec:
replicas: 2
template:
spec:
containers:
- name: inference
image: my-registry/my-model-server:latest
resources:
limits:
nvidia.com/gpu: 4 # 4x B200 = 768 GB HBM3e
nodeSelector:
nvidia.com/gpu.product: B200
# Your cluster, your rules:
# - Any serving framework
# - Any orchestration
# - Any monitoring
# - Any model (OSS, fine-tuned, proprietary)
# - 8 GPUs per node, 1.5 TB aggregate memory# We deploy and optimize. You call the API.
import openai
client = openai.OpenAI(
base_url="https://api.inworld.ai/v1",
api_key="your_inworld_api_key",
)
# Your model on dedicated B200s, served through
# the same API as OpenAI, Anthropic, Google
response = client.chat.completions.create(
model="my-org/finetuned-llama-70b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing."},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")