How to Create Multilingual Audiobooks with ElevenLabs API: Voice Cloning, Text Splitting & Chapter Automation in Python
Build a Complete Multilingual Audiobook Pipeline with ElevenLabs API
ElevenLabs provides one of the most advanced text-to-speech APIs available, offering voice cloning, multilingual synthesis, and high-fidelity audio generation. In this guide, you’ll build a complete Python workflow that takes a book manuscript, splits it into chapters, clones a voice, and generates professional-quality audiobook files in multiple languages — all automated.
Prerequisites and Installation
Before starting, ensure you have Python 3.9+ installed and an ElevenLabs account with API access. A Pro or Scale plan is recommended for voice cloning and higher character limits.
Step 1: Install Dependencies
pip install elevenlabs pydub requests python-dotenv
Step 2: Configure Your API Key
Create a .env file in your project root:
ELEVENLABS_API_KEY=YOUR_API_KEY
Then load it in your Python script:
import os
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv(“ELEVENLABS_API_KEY”)
Voice Cloning Setup
Step 3: Clone a Voice from Audio Samples
Instant Voice Cloning (IVC) requires at least one clean audio sample. For best results, provide 3–5 minutes of clear speech with minimal background noise.
import requests
def clone_voice(name: str, sample_paths: list[str]) -> str:
url = “https://api.elevenlabs.io/v1/voices/add”
headers = {“xi-api-key”: API_KEY}
data = {
“name”: name,
“description”: f”Cloned voice for audiobook narration - {name}”,
“labels”: ’{“use_case”: “audiobook”, “language”: “multilingual”}’
}
files = [
(“files”, (os.path.basename(p), open(p, “rb”), “audio/mpeg”))
for p in sample_paths
]
response = requests.post(url, headers=headers, data=data, files=files)
response.raise_for_status()
voice_id = response.json()[“voice_id”]
print(f”Voice cloned successfully. Voice ID: {voice_id}”)
return voice_id
Usage
voice_id = clone_voice(“Narrator_EN”, [
“samples/narrator_sample1.mp3”,
“samples/narrator_sample2.mp3”
])
Text Splitting and Chapter Detection
Step 4: Parse and Split Book Text into Chapters
ElevenLabs has a 5,000-character limit per API request. This function splits your manuscript by chapters and further chunks long chapters into API-safe segments.
import re
def split_into_chapters(text: str) -> list[dict]:
pattern = r”(Chapter\s+\d+[^\n]*)”
parts = re.split(pattern, text, flags=re.IGNORECASE)
chapters = []
for i in range(1, len(parts), 2):
title = parts[i].strip()
body = parts[i + 1].strip() if i + 1 < len(parts) else ""
chapters.append({“title”: title, “body”: body})
if not chapters:
chapters.append({“title”: “Full Text”, “body”: text.strip()})
return chapters
def chunk_text(text: str, max_chars: int = 4500) -> list[str]:
sentences = re.split(r’(?<=[.!?])\s+’, text)
chunks, current = [], ""
for sentence in sentences:
if len(current) + len(sentence) + 1 > max_chars:
chunks.append(current.strip())
current = sentence
else:
current += ” ” + sentence
if current.strip():
chunks.append(current.strip())
return chunks
Multilingual Audio Generation
Step 5: Generate Audio for Each Chapter and Language
ElevenLabs' eleven_multilingual_v2 model supports 29 languages. Specify the target language naturally in the text or use the model's automatic language detection.
import time
from pathlib import Path
TTS_URL = “https://api.elevenlabs.io/v1/text-to-speech/{voice_id}”
def generate_audio(text: str, voice_id: str, output_path: str,
model: str = “eleven_multilingual_v2”,
stability: float = 0.5,
similarity: float = 0.75) -> None:
url = TTS_URL.format(voice_id=voice_id)
headers = {
“xi-api-key”: API_KEY,
“Content-Type”: “application/json”,
“Accept”: “audio/mpeg”
}
payload = {
“text”: text,
“model_id”: model,
“voice_settings”: {
“stability”: stability,
“similarity_boost”: similarity,
“style”: 0.0,
“use_speaker_boost”: True
}
}
response = requests.post(url, json=payload, headers=headers)
response.raise_for_status()
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
with open(output_path, “wb”) as f:
f.write(response.content)
print(f”Saved: {output_path}“)
Step 6: Orchestrate the Full Pipeline
from pydub import AudioSegment
def create_audiobook(manuscript_path: str, voice_id: str,
languages: list[str], output_dir: str = "output") -> None:
with open(manuscript_path, "r", encoding="utf-8") as f:
text = f.read()
chapters = split_into_chapters(text)
print(f"Found {len(chapters)} chapters")
for lang in languages:
lang_dir = os.path.join(output_dir, lang)
for ch_idx, chapter in enumerate(chapters, 1):
chunks = chunk_text(chapter["body"])
audio_segments = []
for chunk_idx, chunk in enumerate(chunks, 1):
chunk_file = os.path.join(lang_dir, f"ch{ch_idx}_part{chunk_idx}.mp3")
generate_audio(chunk, voice_id, chunk_file)
audio_segments.append(AudioSegment.from_mp3(chunk_file))
time.sleep(1) # Rate limit buffer
# Merge chunk audio files into a single chapter file
combined = sum(audio_segments[1:], audio_segments[0])
chapter_file = os.path.join(lang_dir, f"chapter_{ch_idx:02d}.mp3")
combined.export(chapter_file, format="mp3", bitrate="192k")
print(f"[{lang}] {chapter['title']} -> {chapter_file}")
# Clean up chunk files
for chunk_idx in range(1, len(chunks) + 1):
os.remove(os.path.join(lang_dir, f"ch{ch_idx}_part{chunk_idx}.mp3"))
# Run the pipeline
create_audiobook(
manuscript_path="book.txt",
voice_id=voice_id,
languages=["en", "ko", "ja", "es"]
)Note: For true multilingual output, you should provide translated text for each language. The eleven_multilingual_v2 model handles pronunciation natively per language but does not perform translation itself. Pair this pipeline with a translation API such as Google Translate or DeepL for end-to-end multilingual production.
Pro Tips
- Use voice settings strategically: Lower
stability(0.3–0.4) adds expressiveness for fiction. Higher values (0.7–0.8) suit non-fiction and technical content.- Add SSML-like pauses: Insert”…”or”—”in text to create natural pauses between paragraphs and scene changes.- Batch with Projects API: For books over 100,000 characters, use the ElevenLabs Projects API (/v1/projects) which handles chunking, stitching, and chapter metadata automatically.- Monitor usage: CallGET /v1/user/subscriptionto check remaining character quota before long runs.- Cache voice IDs: Store cloned voice IDs in a config file. Re-cloning the same samples wastes quota and may produce slightly different voice profiles.- Post-process with FFmpeg: Normalize loudness across chapters:ffmpeg -i chapter_01.mp3 -af loudnorm -ar 44100 chapter_01_normalized.mp3
Troubleshooting
| Error | Cause | Solution |
|---|---|---|
401 Unauthorized | Invalid or missing API key | Verify ELEVENLABS_API_KEY in your .env file. Regenerate the key from the ElevenLabs dashboard if needed. |
422 text_too_long | Text exceeds 5,000-character limit | Ensure chunk_text() uses a max_chars value below 5,000. Default 4,500 provides a safe buffer. |
429 Too Many Requests | Rate limit exceeded | Increase time.sleep() delay between requests. Pro plans allow higher concurrency; check your plan limits. |
| Audio sounds robotic or distorted | Poor voice clone samples | Use clean, studio-quality recordings. Remove background noise. Provide at least 3 minutes of varied speech. |
pydub import error | FFmpeg not installed | Install FFmpeg: sudo apt install ffmpeg (Linux) or brew install ffmpeg (macOS) or download from ffmpeg.org (Windows). |
| Wrong language pronunciation | Model mismatch | Use eleven_multilingual_v2 for non-English text. The eleven_monolingual_v1 model only supports English. |
How many languages does ElevenLabs multilingual model support?
The eleven_multilingual_v2 model supports 29 languages including English, Korean, Japanese, Spanish, French, German, Chinese, Arabic, Hindi, and more. The cloned voice adapts its pronunciation to each target language automatically, though the quality is highest for languages with Latin and CJK scripts.
Can I use a cloned voice commercially for audiobooks?
Yes, but only if you have legal rights to the voice. You must own the voice (i.e., it is your own) or have explicit written consent from the voice owner. ElevenLabs’ terms require that cloned voices are used ethically and legally. Commercial audiobook distribution with a cloned voice requires at minimum a Pro plan.
What is the maximum book length this pipeline can handle?
There is no hard limit on book length in the code itself. The practical limit comes from your ElevenLabs character quota. A Pro plan provides approximately 500,000 characters per month. A typical 80,000-word novel is roughly 450,000 characters. For longer works or multiple languages, consider the Scale or Enterprise plan, or spread generation across billing cycles. The Projects API is recommended for books exceeding 100,000 characters as it provides built-in chunking and reliability features.