Cross-lingual voice cloning (2026): clone once, speak 100+ languages

Last updated: May 28, 2026

Cross-lingual voice cloning is the ability to clone a speaker once and synthesize speech in that same voice across many other languages, without re-recording reference audio per locale. Inworld AI Realtime TTS-2 (research preview, launched May 5, 2026) preserves voice identity across 100+ languages from a single clone, covering 15 GA languages at native-speaker quality plus 90+ experimental languages with cross-lingual coverage. Voice cloning on Inworld AI is a 2-step API: first call POST /voices/v1/voices:clone with reference audio to receive a voiceId, then pass that voiceId into TTS requests. The clone retains timbre, cadence, and style, while phonemes are produced natively in the target language.

This page is for developers comparing cross-lingual cloning options for production voice agents, companions, dubbing, and multilingual support workflows. It covers how cross-lingual cloning works, the verified Inworld AI 2-step API contract, language coverage, a fair comparison with ElevenLabs, Cartesia, and Resemble, working Python code for the full clone-and-speak flow, and the ethical guardrails (consent, original audio, deepfake detection) that should be part of any production cloning pipeline.

What is cross-lingual voice cloning?

Cross-lingual voice cloning separates two things that older TTS systems bundled: the speaker identity and the language of synthesis. A single-language clone reproduces the speaker only in the language of the reference recording. A cross-lingual clone learns a speaker representation that can drive synthesis in many languages, preserving the timbre, cadence, and stylistic signature of the original voice while the underlying model handles target-language phonology.

The practical impact: instead of recording separate reference samples in English, Spanish, Japanese, French, and German for the same brand voice, you record once and synthesize in every supported language. For multilingual products like Talkpal, Bible Chat, and consumer companion apps, this collapses what used to be a per-locale data-collection project into a single onboarding step.

How does cross-lingual cloning work under the hood?

Modern cross-lingual TTS factorizes speech into two latent representations: a speaker embedding (who is talking) and a content embedding (what is being said, in which language). The speaker embedding is extracted from short reference audio and stored against a voiceId. At synthesis time, the model conditions on the speaker embedding plus the target-language text, generating audio that pronounces the target language natively while staying inside the speaker's acoustic space.

Realtime TTS-2 extends this with a closed-loop architecture that conditions on prior audio output, not just the transcript. The result is that voice identity holds across utterances inside a session and across language switches inside an utterance. The same voiceId can switch languages mid-sentence with no audible identity drift.

What does voice identity preservation actually mean?

"Voice identity preservation" is the property that a listener who knows the cloned speaker hears the cloned speaker in any target language. Concretely, that means the synthesis preserves:

Timbre. The harmonic signature that makes one human voice distinguishable from another.
Cadence and rhythm. The speaker's pacing pattern, after adjusting for language-specific prosody.
Vocal style. Warmth, energy, articulation habits, and emotional baseline.
Pitch range and contour. Within what the target language allows, the speaker's natural register is held.

What it does not mean: accent transfer from the source language. A native English clone speaking Japanese should sound like a competent Japanese-speaking version of that person, not an English speaker mispronouncing Japanese. The model substitutes target-language phonology while preserving the speaker.

What is the Inworld AI voice cloning API contract?

Voice cloning on Inworld AI is a 2-step API. There is no referenceAudio field on the TTS endpoint, and the old /tts/v1/voices listing endpoint is deprecated July 1, 2026 (use /voices/v1/voices for listing).

Authentication is HTTP Basic with a base64-encoded key:secret pair: Authorization: Basic <base64(key:secret)>. The same auth header applies to both endpoints. Account limits range from 5 to 1,000 cloned voices depending on tier, with custom limits available for enterprise.

How do I clone a voice and use it across languages? (Python)

The example below uses requests directly. The Python SDK inworld-framework-py is dormant, so calling the API directly is the recommended path. Step 1 clones the voice from one reference sample. Step 2 uses the resulting voiceId to synthesize in four languages.

Step 1: clone the voice

import requests
import base64
import os

# pip install requests
INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]  # Basic auth, base64(key:secret)

# Step 1: clone a voice from 5-60 seconds of original human-recorded audio.
# Use WAV/MP3/WEBM up to ~4MB. Multiple samples are supported.
with open("reference_voice_english.wav", "rb") as f:
    sample_b64 = base64.b64encode(f.read()).decode("utf-8")

clone_response = requests.post(
    "https://api.inworld.ai/voices/v1/voices:clone",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "displayName": "BrandVoice_EN",
        "langCode": "EN_US",            # Language of the reference sample
        "voiceSamples": [
            {"audioData": sample_b64}
        ],
        "description": "Calm, friendly product voice",
        "audioProcessingConfig": {
            "removeBackgroundNoise": True
        }
    },
    timeout=60,
)
clone_response.raise_for_status()

cloned_voice_id = clone_response.json()["voice"]["voiceId"]
print(f"Cloned voiceId: {cloned_voice_id}")

The reference sample should be 5 to 60 seconds of clean human speech. Multiple samples can be passed in the voiceSamples array for higher-fidelity clones. The API extends from the original 15-second cap to 60 seconds in the May 2026 release, which materially improves clone quality on edge accents and lower-resource languages.

Step 2: synthesize across languages

import requests
import base64
import os
import json

INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]

# Step 2: use the cloned voiceId in TTS-2 across multiple languages.
# The same voiceId preserves identity in every target language.
TARGETS = [
    ("en", "Welcome back. Here is a summary of your account."),
    ("es", "Bienvenido de nuevo. Aqui tienes un resumen de tu cuenta."),
    ("ja", "お帰りなさいませ。アカウントの要約をお伝えします。"),
    ("fr", "Bon retour. Voici un resume de votre compte."),
]

for lang, text in TARGETS:
    response = requests.post(
        "https://api.inworld.ai/tts/v1/voice",
        headers={
            "Authorization": f"Basic {INWORLD_API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "voiceId": "YOUR_CLONED_VOICE_ID",   # From Step 1
            "modelId": "inworld-tts-2",          # Research preview
            "text": text,
            "language": lang,                     # BCP-47 hint
            "deliveryMode": "BALANCED",          # STABLE / BALANCED / CREATIVE
            "audioConfig": {
                "audioEncoding": "MP3",
                "sampleRateHertz": 24000
            }
        },
        timeout=30,
    )
    response.raise_for_status()
    audio = base64.b64decode(response.json()["audioContent"])
    with open(f"out_{lang}.mp3", "wb") as f:
        f.write(audio)
    print(f"Wrote out_{lang}.mp3")

The same voiceId drives synthesis in English, Spanish, Japanese, and French. The optional language field is a BCP-47 hint that helps the model select the right phonological track. The deliveryMode parameter is TTS-2 only; use BALANCED as the default, STABLE for consistent broadcast-style delivery, CREATIVE for maximum emotional range.

Streaming: same flow, NDJSON response

import requests
import base64
import json
import os

INWORLD_API_KEY = os.environ["INWORLD_API_KEY"]

# Streaming returns NDJSON: each line is {"result": {"audioContent": "base64..."}}
# Parse line by line and decode base64 before writing audio.
response = requests.post(
    "https://api.inworld.ai/tts/v1/voice:stream",
    headers={
        "Authorization": f"Basic {INWORLD_API_KEY}",
        "Content-Type": "application/json",
    },
    json={
        "voiceId": "YOUR_CLONED_VOICE_ID",
        "modelId": "inworld-tts-2",
        "text": "こんにちは。今日は言語を超えて同じ声で話しています。",
        "language": "ja",
        "deliveryMode": "BALANCED",
        "audioConfig": {
            "audioEncoding": "MP3",
            "sampleRateHertz": 24000
        }
    },
    stream=True,
    timeout=60,
)
response.raise_for_status()

with open("streamed_ja.mp3", "wb") as out:
    for line in response.iter_lines():
        if not line:
            continue
        data = json.loads(line)
        chunk = base64.b64decode(data["result"]["audioContent"])
        out.write(chunk)
print("Wrote streamed_ja.mp3")

The streaming endpoint returns newline-delimited JSON. Each line is a JSON object with result.audioContent containing base64-encoded audio for that chunk. Parse line by line, decode base64, and concatenate or pipe to a player.

What languages does a single cloned voice cover?

Inworld AI splits coverage into native-speaker GA quality and experimental cross-lingual coverage.

Two cautions worth surfacing for production planners. First, "100+ languages" is the cross-lingual reach of the model, not 100+ at native-speaker quality. Always test in your top target locales before committing. Second, TTS-2 is a research preview as of May 28, 2026, so language coverage and per-language quality are actively evolving. The 15 GA languages are the stable baseline.

How does Inworld compare to ElevenLabs, Cartesia, and Resemble on cloning?

Each provider takes a different approach. The table below is a fair side-by-side on the dimensions that matter for cross-lingual production work: sample requirement, language coverage, identity preservation across languages, full-stack integration, and special considerations.

A few honest notes:

ElevenLabs PVC remains the quality benchmark for long-form, non-realtime English content. If your workload is audiobook production or studio dubbing and latency does not matter, that is where to test first.
Cartesia is genuinely fast at streaming latency. For applications where streaming TTFB is the primary constraint and the language list is in scope, it competes well.
Resemble's repositioning around deepfake detection is a market signal worth taking seriously. Detection and provenance are becoming part of the production cloning conversation, not a separate compliance track. We address consent and detection in the ethics section below.
PlayHT also offers cloning. Their site was not consistently reachable during the May 2026 audit, so we are omitting specific claims rather than guess. Re-verify before any provider-specific decision.

Should I use AI-generated audio or original human recordings for cloning?

Always clone from the original human-recorded audio. Cloning on top of synthetic audio compounds artifacts: the second clone learns the imperfections of the first generator, layered with the imperfections of the second, and quality degrades. The degradation is most audible on cross-lingual synthesis, where phonetic detail under the speaker embedding matters most.

This shows up most often when teams migrate from one TTS provider to another. The instinct is to capture the existing synthetic voice from the old provider and clone that. Do not do it. Pull the original source recordings, even if they are years old, even if they need cleanup, and clone from those. The 60-second reference cap on Inworld AI cloning was extended in part to give teams more room for clean source-audio capture.

What are the consent and ethics guardrails?

Cross-lingual voice cloning sharpens an ethical surface that single-language cloning already had. A consented English clone can now produce a recording in Japanese that the original speaker never authorized in scope. Operationalize the guardrails:

Documented consent with explicit scope. Languages, use cases, duration, and revocation terms in writing before reference audio is uploaded. For employees, branded voices, and any public-figure recording, this is non-negotiable.
Original audio only. Cloning on top of someone else's generated voice is both a quality problem and a consent problem.
Provenance and detection. Watermark synthetic audio when the use case warrants. Resemble AI has built DETECT-3B Omni around deepfake detection precisely because the production-cloning industry needs detection plus provenance as standard infrastructure, not an afterthought.
Data residency and ownership. Read the provider terms before uploading. Some providers claim broad, perpetual rights to uploaded voice data. For enterprise customers with branded or celebrity voices, negotiate data terms first.

How do I integrate a cloned voice into a realtime voice agent?

A cloned voiceId is just another voice. In the Realtime API, set audio.output.voice to the cloned voice and audio.output.model to inworld-tts-2. The same identity carries through full-duplex conversation. In the model-agnostic Realtime API, you can switch the underlying LLM through the Router (220+ LLMs from OpenAI, Anthropic, Google, Groq, Fireworks, Mistral, DeepSeek, and others) without changing the TTS voice, so the agent persona is decoupled from the reasoning model.

For batch and asynchronous workflows (notifications, dubbing pipelines, content production), call /tts/v1/voice non-streaming with the cloned voiceId. For interactive sessions, call /tts/v1/voice:stream and consume the NDJSON line-by-line. Streaming TTFB on Realtime TTS-2 is sub-200ms median for the audio start; for end-to-end realtime conversational latency, use the Realtime API rather than chaining the underlying APIs manually.

When should I pick which provider?

A short decision guide for the most common production scenarios.

Realtime voice agents and companions across multiple languages. Inworld AI. One clone, 15 GA + 90+ experimental languages, full-stack integration (TTS, STT, Realtime API, Router), 60-second IVC reference window.
Long-form English audiobooks and studio dubbing where latency does not matter. ElevenLabs PVC. The professional service is still the benchmark for top-end English fidelity.
Streaming-latency-sensitive single-language consumer apps. Cartesia is competitive on streaming TTFB and worth side-by-side testing.
Regulated industries that require provenance, watermarking, and on-premise deployment as primary requirements. Resemble AI, with detection front-and-center.
Multilingual product where consent and language coverage are both load-bearing. Inworld AI. Use the 2-step API, capture documented consent with scope per language, retain original audio in your own storage, and consider watermarking sensitive synthetic output.

FAQ

What is cross-lingual voice cloning?

Cross-lingual voice cloning is the ability to clone a speaker once, in a single language, and then synthesize speech in that same voice across other languages without re-recording reference audio per locale. The cloned voice keeps the speaker's identity, timbre, and style while pronouncing target-language phonemes natively. Inworld AI Realtime TTS-2 (research preview) clones from 5 to 60 seconds of reference audio and supports 15 languages at native-speaker quality plus 90+ additional languages with cross-lingual coverage.

How many languages does a single cloned voice cover?

With Realtime TTS-2 (research preview), one clone covers 15 GA languages at native-speaker quality plus 90+ experimental languages with cross-lingual identity, for over 100 languages total. ElevenLabs Multilingual v2 and Eleven v3 cover roughly 70 languages from a single clone. Cartesia Sonic 3.5 covers a smaller set of supported locales. Per-language quality varies across all providers, so test with your actual locales.

How do I clone a voice on Inworld AI?

Voice cloning on Inworld AI is a 2-step API. Step 1: POST to /voices/v1/voices:clone with a displayName, langCode, and voiceSamples array containing base64-encoded reference audio. The response returns a voiceId. Step 2: pass that voiceId into /tts/v1/voice or /tts/v1/voice:stream in subsequent TTS requests. There is no referenceAudio field on the TTS endpoint, and the older /tts/v1/voices listing endpoint is deprecated July 1, 2026.

Should I use AI-generated audio or original human recordings for cloning?

Always use original human recordings. Cloning on top of synthetic audio compounds artifacts and degrades the resulting clone, especially on cross-lingual synthesis where phonetic detail matters most. If you are migrating from another provider, re-clone from your original source files, not from the output of the previous TTS system.

What are the ethical guardrails for cross-lingual voice cloning?

Always obtain explicit, documented consent from the voice owner before cloning, with clear scope (languages, use cases, duration). Watermark synthetic audio where the use case warrants it. Resemble AI has repositioned around DETECT-3B Omni deepfake detection, a signal that the industry is converging on detection plus provenance as part of any production cloning workflow. For branded or celebrity voices, negotiate data ownership terms before uploading reference audio.

Published by Inworld AI. Voice cloning API contracts, language coverage, and competitor product lists verified May 28, 2026 against docs.inworld.ai, the live API reference, and competitor websites. Realtime TTS-2 is in research preview and details may evolve.

Cross-lingual voice cloning in 2026: clone once, speak 100+ languages