JavaScript TTS API Tutorial: Text-to-Speech in 5 Lines (2026)

Q: How do I add text-to-speech to a JavaScript app?

Use the built-in fetch API to POST to https://api.inworld.ai/tts/v1/voice with your text, voiceId, and modelId in the JSON body. Decode the base64 audioContent from the response and write it to a file with fs.writeFileSync.

Q: Does Realtime TTS support streaming in Node.js?

Yes. Use the /tts/v1/voice:stream endpoint. The response is NDJSON where each line contains a JSON object with a base64-encoded audio chunk. Parse line-by-line and decode each chunk with Buffer.from(chunk, 'base64').

Q: How do I clone a voice with the Inworld JavaScript API?

POST to /voices/v1/voices:clone with a displayName, langCode, and voiceSamples array containing base64-encoded audio. The API returns a voice object with a custom voiceId you can use in subsequent TTS calls.

Last updated: April 13, 2026

Inworld AI's Realtime TTS-2 (research preview) is the #1 realtime TTS on the Artificial Analysis Realtime TTS Arena. This tutorial shows how to call the Inworld AI TTS API from JavaScript and Node.js, starting with a 5-line example and building up to streaming, voice cloning, and a complete voice pipeline. Every code block is copy-paste ready. You need a free API key from platform.inworld.ai and Node.js 18+ (for native fetch and top-level await).

const API_KEY = process.env.INWORLD_API_KEY;
const response = await fetch("https://api.inworld.ai/tts/v1/voice", {
  method: "POST",
  headers: { "Authorization": `Basic ${API_KEY}`, "Content-Type": "application/json" },
  body: JSON.stringify({ voiceId: "Sarah", modelId: "inworld-tts-1.5-max", text: "Hello world",
    audioConfig: { audioEncoding: "MP3", sampleRateHertz: 24000 } })
});
if (!response.ok) throw new Error(`TTS request failed: ${response.status}`);
const { audioContent } = await response.json();
require("fs").writeFileSync("output.mp3", Buffer.from(audioContent, "base64"));

That is all it takes to generate speech. The rest of this guide covers streaming with NDJSON parsing, voice cloning, long-text chunking, and a full STT-to-LLM-to-TTS pipeline.

How Do I Call the Realtime TTS API from JavaScript?

The synchronous endpoint accepts a JSON payload with three required fields (voiceId, modelId, text) plus an audioConfig object that sets the audio encoding and sample rate. Authentication uses Basic auth with your API key. The response returns a JSON object with base64-encoded audio in the audioContent field. You must decode it before writing to a file.

import fs from "node:fs";

// No SDK required - just fetch
const API_KEY = "your_api_key_here"; // From https://platform.inworld.ai

const response = await fetch("https://api.inworld.ai/tts/v1/voice", {
  method: "POST",
  headers: {
    "Authorization": `Basic ${API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    voiceId: "Sarah",
    modelId: "inworld-tts-1.5-max",
    text: "Welcome to our application. This audio was generated with the Realtime TTS API.",
    audioConfig: {
      audioEncoding: "MP3",
      sampleRateHertz: 24000
    }
  })
});

if (!response.ok) {
  throw new Error(`TTS request failed: ${response.status} ${response.statusText}`);
}

const result = await response.json();
const audioBytes = Buffer.from(result.audioContent, "base64");

fs.writeFileSync("output.mp3", audioBytes);
console.log(`Saved ${audioBytes.length} bytes to output.mp3`);

A few details:

Node.js 18+ includes native fetch and top-level await in .mjs files. No external HTTP library needed.
voiceId selects the voice. The default is "Sarah." Browse the full voice library across 15 GA languages (TTS 1.5) or 100+ languages (TTS-2 cross-lingual) via GET /voices/v1/voices.
modelId selects the model. inworld-tts-1.5-max optimizes for quality. inworld-tts-1.5-mini optimizes for realtime latency if speed matters more than fidelity.
audioConfig sets the encoding and sample rate. The default is MP3 at 24kHz. Set audioEncoding to LINEAR16, WAV, OGG_OPUS, MULAW, ALAW, or FLAC, and sampleRateHertz to your preferred rate (8000-48000).
Max input is 2,000 characters per request. For longer text, chunk at sentence boundaries (see the chunking section below).

const API_KEY = "your_api_key_here";

const response = await fetch("https://api.inworld.ai/voices/v1/voices", {
  headers: { "Authorization": `Basic ${API_KEY}` }
});

const { voices } = await response.json();
console.log(`Available voices: ${voices.length}`);
voices.slice(0, 10).forEach(v => {
  console.log(`  ${v.voiceId}: ${v.displayName}`);
});

How Do I Stream TTS Audio in Node.js?

Streaming is the recommended approach for any interactive application. Instead of waiting for the entire audio file to be generated, the streaming endpoint (/tts/v1/voice:stream) returns audio chunks as they are synthesized at realtime latency.

The response format is NDJSON (newline-delimited JSON). Each line is a standalone JSON object containing a base64-encoded audio chunk. This is not raw binary. You must parse each line with JSON.parse and then decode the base64 audioContent with Buffer.from.

import fs from "node:fs";

const API_KEY = "your_api_key_here";

const text = `This is a longer passage that benefits from streaming.
The API returns audio chunks as they are generated,
so playback can start before the full synthesis is complete.`;

const response = await fetch("https://api.inworld.ai/tts/v1/voice:stream", {
  method: "POST",
  headers: {
    "Authorization": `Basic ${API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    voiceId: "Sarah",
    modelId: "inworld-tts-1.5-max",
    text,
    audioConfig: {
      audioEncoding: "MP3",
      sampleRateHertz: 24000
    }
  })
});

if (!response.ok) {
  throw new Error(`Stream request failed: ${response.status}`);
}

// Parse NDJSON: each line is a JSON object with base64-encoded audio
const decoder = new TextDecoder();
const reader = response.body.getReader();
const audioChunks = [];
let buffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split("\n");
  buffer = lines.pop(); // Keep incomplete line in buffer

  for (const line of lines) {
    if (line.trim()) {
      const data = JSON.parse(line);
      const chunk = Buffer.from(data.result.audioContent, "base64");
      audioChunks.push(chunk);
    }
  }
}

// Process any remaining data in buffer
if (buffer.trim()) {
  const data = JSON.parse(buffer);
  audioChunks.push(Buffer.from(data.result.audioContent, "base64"));
}

const audio = Buffer.concat(audioChunks);
fs.writeFileSync("streamed_output.mp3", audio);
console.log(`Received ${audioChunks.length} chunks, ${audio.length} bytes total`);

The key to correct NDJSON parsing in JavaScript: use a TextDecoder to accumulate the response stream into text, split on newlines, and keep any incomplete trailing line in a buffer for the next iteration. Each complete line is a valid JSON object.

Realtime Playback with Speaker

For applications that need to play audio as it arrives (voice agents, chatbots, accessibility tools), combine streaming with the speaker npm package to write PCM chunks directly to the audio output:

// Realtime playback with speaker (requires: npm install speaker)
import Speaker from "speaker";

const API_KEY = "your_api_key_here";

const response = await fetch("https://api.inworld.ai/tts/v1/voice:stream", {
  method: "POST",
  headers: {
    "Authorization": `Basic ${API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    voiceId: "Sarah",
    modelId: "inworld-tts-1.5-max",
    text: "Streaming audio plays back in realtime as chunks arrive.",
    audioConfig: {
      audioEncoding: "LINEAR16",
      sampleRateHertz: 24000
    }
  })
});

if (!response.ok) {
  const errorBody = await response.text();
  throw new Error(`TTS streaming failed: ${response.status} ${errorBody}`);
}

// Play each chunk as it arrives
const speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 });

const decoder = new TextDecoder();
const reader = response.body.getReader();
let buffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split("\n");
  buffer = lines.pop();

  for (const line of lines) {
    if (line.trim()) {
      const data = JSON.parse(line);
      const chunk = Buffer.from(data.result.audioContent, "base64");
      speaker.write(chunk);
    }
  }
}

// Flush any remaining buffered data
if (buffer.trim()) {
  const data = JSON.parse(buffer);
  const chunk = Buffer.from(data.result.audioContent, "base64");
  speaker.write(chunk);
}

speaker.end();

The key difference from the file-saving example: set audioEncoding to LINEAR16 and sampleRateHertz to 24000, then write raw PCM bytes to the speaker. Each chunk plays the moment it arrives. Users hear the first word at realtime latency while the rest of the sentence is still being generated.

Common Streaming Mistakes

Avoid these patterns that look correct but produce broken audio:

Reading response.arrayBuffer() directly treats the response as raw binary. The stream is NDJSON (JSON text), not binary audio. You will get corrupted output.
Writing response bytes directly to a file without base64 decoding. The response body is JSON text, not audio bytes.
Using voice instead of voiceId in the request body. The REST TTS API uses voiceId. The voice field is for the Realtime API WebSocket protocol.
Using model instead of modelId. The REST TTS API uses modelId. The model field is for the Router API.

How Do I Clone a Voice with JavaScript?

Voice cloning creates a custom voice from a 5-15 second audio sample. The cloned voice can then be used in any TTS call. Samples longer than 15 seconds are automatically trimmed. Supported formats: wav, mp3, webm. Maximum file size: 4MB.

This is a 2-step process: first clone the voice to get a voiceId, then use that ID in regular TTS calls.

import fs from "node:fs";

const API_KEY = "your_api_key_here";

// Step 1: Clone a voice from an audio sample (5-15 seconds, wav/mp3/webm, max 4MB)
const audioData = fs.readFileSync("voice_sample.wav").toString("base64");

const cloneResponse = await fetch("https://api.inworld.ai/voices/v1/voices:clone", {
  method: "POST",
  headers: {
    "Authorization": `Basic ${API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    displayName: "my-custom-voice",
    langCode: "EN_US",
    voiceSamples: [{ audioData }]
  })
});

if (!cloneResponse.ok) {
  throw new Error(`Clone failed: ${cloneResponse.status} ${await cloneResponse.text()}`);
}

const clonedVoice = await cloneResponse.json();
const customVoiceId = clonedVoice.voice.voiceId;
console.log(`Cloned voice ID: ${customVoiceId}`);

// Step 2: Use the cloned voice for TTS
const ttsResponse = await fetch("https://api.inworld.ai/tts/v1/voice", {
  method: "POST",
  headers: {
    "Authorization": `Basic ${API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    voiceId: customVoiceId,
    modelId: "inworld-tts-1.5-max",
    text: "This speech uses my cloned voice.",
    audioConfig: {
      audioEncoding: "MP3",
      sampleRateHertz: 24000
    }
  })
});

if (!ttsResponse.ok) {
  throw new Error(`TTS failed: ${ttsResponse.status} ${await ttsResponse.text()}`);
}

const { audioContent } = await ttsResponse.json();
fs.writeFileSync("cloned_voice_output.mp3", Buffer.from(audioContent, "base64"));

The endpoint for cloning is POST /voices/v1/voices:clone, which is separate from the TTS endpoint. The cloned voiceId works exactly like any built-in voice. Pass it to the synchronous or streaming endpoint. Each account can create up to 1,000 cloned voices.

How Do I Build a Complete Voice Pipeline in JavaScript?

A voice pipeline chains three APIs into one flow: STT transcribes audio input, an LLM generates a response, and TTS converts that response back to speech. With Inworld, all three steps use the same API key and authentication.

import fs from "node:fs";

const API_KEY = "your_api_key_here";

const headers = {
  "Authorization": `Basic ${API_KEY}`,
  "Content-Type": "application/json"
};

// Step 1: Transcribe audio with Realtime STT
const audioInput = fs.readFileSync("user_audio.wav").toString("base64");

const sttResponse = await fetch("https://api.inworld.ai/stt/v1/transcribe", {
  method: "POST",
  headers,
  body: JSON.stringify({
    transcribeConfig: {
      modelId: "inworld/inworld-stt-1",
      audioEncoding: "AUTO_DETECT",
      language: "en-US"
    },
    audioData: {
      content: audioInput
    }
  })
});

if (!sttResponse.ok) throw new Error(`STT failed: ${sttResponse.status}`);
const sttResult = await sttResponse.json();
const transcript = sttResult.transcription.transcript;
console.log(`User said: ${transcript}`);

// Step 2: Send transcript to an LLM via Inworld Router
const llmResponse = await fetch("https://api.inworld.ai/v1/chat/completions", {
  method: "POST",
  headers,
  body: JSON.stringify({
    model: "openai/gpt-5.5",
    messages: [
      { role: "system", content: "You are a helpful voice assistant. Keep responses under 200 words." },
      { role: "user", content: transcript }
    ]
  })
});

if (!llmResponse.ok) throw new Error(`Router failed: ${llmResponse.status}`);
const llmResult = await llmResponse.json();
const reply = llmResult.choices[0].message.content;
console.log(`Assistant: ${reply}`);

// Step 3: Convert the LLM reply to speech with Realtime TTS (streaming)
const ttsResponse = await fetch("https://api.inworld.ai/tts/v1/voice:stream", {
  method: "POST",
  headers,
  body: JSON.stringify({
    voiceId: "Sarah",
    modelId: "inworld-tts-1.5-max",
    text: reply,
    audioConfig: {
      audioEncoding: "MP3",
      sampleRateHertz: 24000
    }
  })
});

if (!ttsResponse.ok) throw new Error(`TTS failed: ${ttsResponse.status}`);

const decoder = new TextDecoder();
const reader = ttsResponse.body.getReader();
const audioChunks = [];
let lineBuffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  lineBuffer += decoder.decode(value, { stream: true });
  const lines = lineBuffer.split("\n");
  lineBuffer = lines.pop();

  for (const line of lines) {
    if (line.trim()) {
      const data = JSON.parse(line);
      audioChunks.push(Buffer.from(data.result.audioContent, "base64"));
    }
  }
}

if (lineBuffer.trim()) {
  const data = JSON.parse(lineBuffer);
  audioChunks.push(Buffer.from(data.result.audioContent, "base64"));
}

const audio = Buffer.concat(audioChunks);
fs.writeFileSync("response.mp3", audio);
console.log("Full pipeline complete: transcription -> reasoning -> speech");

This pipeline uses three Inworld APIs:

STT (/stt/v1/transcribe) converts user audio to text using transcribeConfig and audioData fields
Router (/v1/chat/completions) sends the transcript to any LLM (the Router routes to 200+ models from major providers)
TTS (/tts/v1/voice:stream) converts the LLM reply back to speech with streaming

Note how the Router uses model (not modelId) because it follows the OpenAI Chat Completions format. The TTS endpoint uses voiceId and modelId. Different APIs, different field names.

For production voice agents that need lower latency and bidirectional audio, the Inworld Realtime API handles all three steps over a single WebSocket connection with built-in turn detection and barge-in support.

How Do I Handle Long Text in JavaScript?

The TTS API accepts a maximum of 2,000 characters per request. For longer content (articles, documentation, email bodies), split the text at sentence boundaries and synthesize each chunk separately:

function chunkText(text, maxChars = 1500) {
  const sentences = text.split(/(?<=[.!?])\s+/);
  const chunks = [];
  let current = "";

  for (const sentence of sentences) {
    if (current.length + sentence.length + 1 > maxChars) {
      if (current) chunks.push(current.trim());
      current = sentence;
    } else {
      current = current ? `${current} ${sentence}` : sentence;
    }
  }
  if (current) chunks.push(current.trim());
  return chunks;
}

// Usage: synthesize long text in chunks
const longText = "..."; // Any length
for (const [i, chunk] of chunkText(longText).entries()) {
  const response = await fetch("https://api.inworld.ai/tts/v1/voice:stream", {
    method: "POST",
    headers: {
      "Authorization": `Basic ${API_KEY}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({
      voiceId: "Sarah",
      modelId: "inworld-tts-1.5-max",
      text: chunk,
      audioConfig: { audioEncoding: "MP3", sampleRateHertz: 24000 }
    })
  });
  // Process each chunk's streaming NDJSON response...
}

Keep chunks between 500 and 1,600 characters. Splitting mid-sentence creates unnatural pauses. Splitting at paragraph or sentence boundaries preserves natural prosody.

How Does Inworld Compare to Other JavaScript TTS Options?

Sources: Artificial Analysis TTS leaderboard (May 2026), ElevenLabs docs, OpenAI TTS docs.

The biggest differentiator for JavaScript developers: Inworld requires zero SDK installation. The entire API surface is accessible with native fetch, which ships with Node.js 18+ and every modern browser. No proprietary client, no version conflicts, no dependency tree.

Frequently Asked Questions

How do I add text-to-speech to a JavaScript app?

Use the built-in fetch API to POST to the /tts/v1/voice endpoint at api.inworld.ai with your text, voiceId, modelId, and audioConfig in the JSON body. Decode the base64 audioContent from the response with Buffer.from(audioContent, "base64") and write it to a file. Five lines of code, no SDK required. Works in Node.js 18+ and modern browsers.

Does Realtime TTS support streaming in Node.js?

Yes. Use the /tts/v1/voice:stream endpoint. The response is NDJSON where each line contains a JSON object with a base64-encoded audio chunk. Read the stream with response.body.getReader(), accumulate text with TextDecoder, split on newlines, and decode each chunk. First audio arrives at realtime latency.

How do I clone a voice with the Inworld JavaScript API?

POST to the /voices/v1/voices:clone endpoint at api.inworld.ai with a displayName, langCode, and voiceSamples array containing base64-encoded audio (5-15 seconds, wav/mp3/webm, max 4MB). The API returns a voice object with a custom voiceId you can use in any subsequent TTS call. Up to 1,000 cloned voices per account.

What is the best TTS API for JavaScript developers?

Inworld AI's Realtime TTS-2 (research preview) is the #1 realtime TTS on the Artificial Analysis Realtime TTS Arena, based on thousands of blind comparisons. It delivers sub-200ms TTFT with cross-lingual voice identity across 100+ languages (15 GA + 90+ experimental). For the lowest latency, choose inworld-tts-1.5-mini.

Can I build a full voice pipeline in JavaScript?

Yes. Combine Realtime STT (transcription), Router (LLM reasoning across 200+ models from major providers), and TTS (speech output) in a single Node.js script. All three APIs share the same API key and Basic auth. For realtime bidirectional voice, the Inworld Realtime API handles the full pipeline over WebSocket.

What is the maximum text length for Realtime TTS?

2,000 characters per request. For longer text, chunk at sentence boundaries (500-1,600 characters per chunk) and make multiple streaming requests. The chunking example above shows how to split text cleanly without breaking mid-sentence.

What audio formats does Realtime TTS support?

MP3 (default), LINEAR16, WAV, OGG_OPUS, MULAW, ALAW, and FLAC. Sample rates from 8kHz to 48kHz. Set these via the audioConfig object in your request. Use LINEAR16 at 24kHz for realtime playback with the speaker npm package.

Do I need an SDK to use Realtime TTS in Node.js?

No. The API is a standard REST endpoint. Node.js 18+ includes native fetch, so you need zero dependencies to call the API. This makes Realtime TTS straightforward to integrate into any Node.js project, Express server, Next.js API route, or serverless function.

How to Add Text-to-Speech to a JavaScript App with Inworld AI