By Kylan Gibbs, CEO and Co-founder, Inworld AI
Last updated: April 2026
PlayHT shut down its developer API in mid-2025 and fully sunset the platform on December 31, 2025, deleting user voice data after the cutoff. Inworld AI's
Realtime TTS is the natural migration target for PlayHT customers: ranked #1 on the
Artificial Analysis Speech Arena with three of the top five positions, sub-200ms time-to-first-audio, instant voice cloning from 5-15 seconds of audio, 15 production languages, and OpenAI-compatible Router for the rest of your stack. This guide walks through the migration: model mapping, voice recloning, code changes, and the operational gotchas that bite teams who try to swap providers in a weekend.
Why PlayHT Customers Need to Move Now
If you have not migrated yet:
- Existing PlayHT API integrations are dead. The endpoints have been offline since mid-2025.
- Voice clones are gone. PlayHT deleted user voice data at the December 31, 2025 sunset. There is no recovery path through PlayHT.
- Your only option is to reclone from your original audio sources into a new provider.
The good news: if you still have your original training audio (the human-recorded samples used to clone the voice), you can reclone into Realtime TTS in minutes. The bad news: do not try to reclone using AI-generated audio output from PlayHT. That introduces compounding artifacts and degrades quality across generations. Use original human audio.
Model Mapping: PlayHT to Realtime TTS
| PlayHT model | Realtime TTS equivalent | Notes |
|---|
| PlayHT 2.0 / Play 3.0 | inworld-tts-1.5-max | Best quality, sub-200ms TTFB, 15 languages |
| PlayHT Turbo | inworld-tts-1.5-mini | Fastest TTFB (~120ms), best for high-volume streaming |
| PlayHT voice clones | Re-clone via POST /voices/v1/voices:clone | 2-step process; use original human audio |
| PlayHT stock voices | Realtime TTS voice library (271+ voices) | Closest matches via voice search |
Migration Step 1: Reclone Your Voices
Voice cloning in Realtime TTS is a two-step process: clone first to get a voiceId, then use that voiceId in TTS calls. There is no referenceAudio field on the TTS endpoint.
import requests
import base64
with open("original_voice_sample.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
# Step 1: Clone the voice
clone_response = requests.post(
"https://api.inworld.ai/voices/v1/voices:clone",
headers={"Authorization": "Basic <your-api-key>"},
json={
"displayName": "Customer Service Agent",
"langCode": "EN_US",
"voiceSamples": [
{"audioData": audio_b64}
],
"audioProcessingConfig": {"removeBackgroundNoise": True}
}
)
voice_id = clone_response.json()["voice"]["voiceId"]
# Step 2: Use the cloned voice in TTS calls
tts_response = requests.post(
"https://api.inworld.ai/tts/v1/voice",
headers={"Authorization": "Basic <your-api-key>"},
json={
"text": "Hello, how can I help you today?",
"voiceId": voice_id,
"modelId": "inworld-tts-1.5-max",
"audioConfig": {
"audioEncoding": "MP3",
"sampleRateHertz": 24000
}
}
)
with open("output.mp3", "wb") as f:
f.write(base64.b64decode(tts_response.json()["audioContent"]))
Cloning requirements:
- 5-15 seconds of clean original audio (samples >15s auto-trimmed).
- Formats: WAV, MP3, WEBM. Max 4MB per sample.
- 1,000 cloned voices per account (higher limits via enterprise sales).
- Use original human-recorded audio, not AI-generated PlayHT output. Generation-on-generation cloning compounds artifacts.
Migration Step 2: Swap the Synthesis API
# Before: PlayHT synthesis (legacy reference)
# response = playht.tts(text=text, voice="s3://voice-cloning-zero-shot/...", ...)
# After: Realtime TTS synthesis
import requests
import base64
response = requests.post(
"https://api.inworld.ai/tts/v1/voice",
headers={"Authorization": "Basic <your-api-key>"},
json={
"text": "Hello world",
"voiceId": "Sarah", # or your cloned voiceId
"modelId": "inworld-tts-1.5-max",
"audioConfig": {
"audioEncoding": "MP3",
"sampleRateHertz": 24000
}
}
)
audio_bytes = base64.b64decode(response.json()["audioContent"])
with open("output.mp3", "wb") as f:
f.write(audio_bytes)
For real-time applications, use the streaming endpoint which returns NDJSON (newline-delimited JSON) with base64 audio chunks:
import requests
import base64
import json
with requests.post(
"https://api.inworld.ai/tts/v1/voice:stream",
headers={"Authorization": "Basic <your-api-key>"},
json={
"text": "Hello world",
"voiceId": "Sarah",
"modelId": "inworld-tts-1.5-mini", # mini for lowest TTFB
"audioConfig": {
"audioEncoding": "PCM",
"sampleRateHertz": 24000
}
},
stream=True
) as r:
for line in r.iter_lines():
if not line:
continue
chunk_obj = json.loads(line)
audio_bytes = base64.b64decode(
chunk_obj["result"]["audioContent"]
)
# play / forward audio_bytes to client
Migration Step 3: Voice Library Mapping
If you used PlayHT stock voices rather than custom clones, browse the Realtime TTS voice library:
import requests
response = requests.get(
"https://api.inworld.ai/voices/v1/voices?languages=EN_US",
headers={"Authorization": "Basic <your-api-key>"}
)
for voice in response.json()["voices"]:
print(voice["voiceId"], voice["displayName"], voice.get("description"))
Realtime TTS ships with 271+ voices, more than three times PlayHT's stock voice library. The default for code examples is Sarah ("fast-talking young adult woman, with a questioning and curious tone"). Other production-grade voices: Jason, Alex, Olivia, Hana, Clive, Blake, Carter, Liam, Claire, Ethan.
Migration Checklist
- [ ] Pull all original human-recorded voice samples used for PlayHT clones.
- [ ] Get an Inworld API key from the Portal.
- [ ] Reclone each voice via
POST /voices/v1/voices:clone. Save the returned voiceId.
- [ ] Update synthesis calls: change endpoint, swap
voice for voiceId, swap model for modelId, add audioConfig (audioEncoding + sampleRateHertz).
- [ ] Update streaming parsing: PlayHT used various streaming formats; Realtime TTS streaming is NDJSON with base64
audioContent per line.
- [ ] Decode base64 before writing audio. Both sync and streaming responses return base64.
- [ ] Test latency end-to-end. Realtime TTS Mini delivers ~120ms TTFB; Max delivers sub-200ms.
- [ ] If you used PlayHT for the full speech loop, consider migrating the LLM layer too. The Realtime Router is OpenAI SDK-compatible and routes to hundreds of models.
Why Realtime TTS After PlayHT
Three reasons PlayHT customers consistently land on Realtime TTS:
- Voice quality. #1 on Artificial Analysis Speech Arena, three of top five. PlayHT was a strong product; Realtime TTS pushes further on expressiveness and conversational naturalness.
- 15 production languages with native-speaker quality. PlayHT's multilingual coverage was uneven; Realtime TTS is engineered for production deployment in all 15.
- Full pipeline integration. Realtime TTS pairs with Realtime STT, the Realtime Router, and the Realtime API for end-to-end voice applications. PlayHT was TTS-only.
FAQ
What happened to PlayHT?
PlayHT shut down its developer API in mid-2025 and fully sunset the platform on December 31, 2025. User voice data was deleted at the cutoff. There is no recovery path through PlayHT.
Can I migrate my PlayHT voice clones to a new provider?
You cannot transfer the clones directly because PlayHT deleted the underlying voice data. You must reclone from your original human-recorded audio samples. Realtime TTS supports instant cloning from 5-15 seconds of audio via the POST /voices/v1/voices:clone endpoint.
Should I reclone using AI-generated PlayHT audio?
No. Generation-on-generation cloning compounds artifacts and degrades quality. Always use the original human-recorded audio you used to create the PlayHT clone.
How does Realtime TTS compare to PlayHT on quality?
Realtime TTS ranks #1 on the
Artificial Analysis Speech Arena, holding three of the top five positions. PlayHT was not consistently in the top tier on independent blind evaluation.
What is the easiest way to migrate code?
Three changes: switch the endpoint to https://api.inworld.ai/tts/v1/voice (sync) or /voice:stream (streaming), use voiceId and modelId field names, and add an audioConfig object with audioEncoding and sampleRateHertz. Authentication is Authorization: Basic <api-key> (Basic, not Bearer). Both endpoints return base64 in audioContent; decode before writing audio.