How to Create Multilingual Audiobooks with ElevenLabs API: Voice Cloning, Text Splitting & Chapter Automation in Python

Build a Complete Multilingual Audiobook Pipeline with ElevenLabs API

ElevenLabs provides one of the most advanced text-to-speech APIs available, offering voice cloning, multilingual synthesis, and high-fidelity audio generation. In this guide, you’ll build a complete Python workflow that takes a book manuscript, splits it into chapters, clones a voice, and generates professional-quality audiobook files in multiple languages — all automated.

Prerequisites and Installation

Before starting, ensure you have Python 3.9+ installed and an ElevenLabs account with API access. A Pro or Scale plan is recommended for voice cloning and higher character limits.

Step 1: Install Dependencies

pip install elevenlabs pydub requests python-dotenv

Step 2: Configure Your API Key

Create a .env file in your project root: ELEVENLABS_API_KEY=YOUR_API_KEY

Then load it in your Python script: import os from dotenv import load_dotenv

load_dotenv() API_KEY = os.getenv(“ELEVENLABS_API_KEY”)

Voice Cloning Setup

Step 3: Clone a Voice from Audio Samples

Instant Voice Cloning (IVC) requires at least one clean audio sample. For best results, provide 3–5 minutes of clear speech with minimal background noise. import requests

def clone_voice(name: str, sample_paths: list[str]) -> str: url = “https://api.elevenlabs.io/v1/voices/add” headers = {“xi-api-key”: API_KEY} data = { “name”: name, “description”: f”Cloned voice for audiobook narration - {name}”, “labels”: ’{“use_case”: “audiobook”, “language”: “multilingual”}’ } files = [ (“files”, (os.path.basename(p), open(p, “rb”), “audio/mpeg”)) for p in sample_paths ] response = requests.post(url, headers=headers, data=data, files=files) response.raise_for_status() voice_id = response.json()[“voice_id”] print(f”Voice cloned successfully. Voice ID: {voice_id}”) return voice_id


Usage

voice_id = clone_voice(“Narrator_EN”, [ “samples/narrator_sample1.mp3”, “samples/narrator_sample2.mp3” ])

Text Splitting and Chapter Detection

Step 4: Parse and Split Book Text into Chapters

ElevenLabs has a 5,000-character limit per API request. This function splits your manuscript by chapters and further chunks long chapters into API-safe segments. import re

def split_into_chapters(text: str) -> list[dict]: pattern = r”(Chapter\s+\d+[^\n]*)” parts = re.split(pattern, text, flags=re.IGNORECASE) chapters = [] for i in range(1, len(parts), 2): title = parts[i].strip() body = parts[i + 1].strip() if i + 1 < len(parts) else "" chapters.append({“title”: title, “body”: body}) if not chapters: chapters.append({“title”: “Full Text”, “body”: text.strip()}) return chapters

def chunk_text(text: str, max_chars: int = 4500) -> list[str]: sentences = re.split(r’(?<=[.!?])\s+’, text) chunks, current = [], "" for sentence in sentences: if len(current) + len(sentence) + 1 > max_chars: chunks.append(current.strip()) current = sentence else: current += ” ” + sentence if current.strip(): chunks.append(current.strip()) return chunks

Multilingual Audio Generation

Step 5: Generate Audio for Each Chapter and Language

ElevenLabs' eleven_multilingual_v2 model supports 29 languages. Specify the target language naturally in the text or use the model's automatic language detection. import time from pathlib import Path

TTS_URL = “https://api.elevenlabs.io/v1/text-to-speech/{voice_id}”

def generate_audio(text: str, voice_id: str, output_path: str, model: str = “eleven_multilingual_v2”, stability: float = 0.5, similarity: float = 0.75) -> None: url = TTS_URL.format(voice_id=voice_id) headers = { “xi-api-key”: API_KEY, “Content-Type”: “application/json”, “Accept”: “audio/mpeg” } payload = { “text”: text, “model_id”: model, “voice_settings”: { “stability”: stability, “similarity_boost”: similarity, “style”: 0.0, “use_speaker_boost”: True } } response = requests.post(url, json=payload, headers=headers) response.raise_for_status() Path(output_path).parent.mkdir(parents=True, exist_ok=True) with open(output_path, “wb”) as f: f.write(response.content) print(f”Saved: {output_path}“)

Step 6: Orchestrate the Full Pipeline

from pydub import AudioSegment

def create_audiobook(manuscript_path: str, voice_id: str,
                     languages: list[str], output_dir: str = "output") -> None:
    with open(manuscript_path, "r", encoding="utf-8") as f:
        text = f.read()

    chapters = split_into_chapters(text)
    print(f"Found {len(chapters)} chapters")

    for lang in languages:
        lang_dir = os.path.join(output_dir, lang)
        for ch_idx, chapter in enumerate(chapters, 1):
            chunks = chunk_text(chapter["body"])
            audio_segments = []

            for chunk_idx, chunk in enumerate(chunks, 1):
                chunk_file = os.path.join(lang_dir, f"ch{ch_idx}_part{chunk_idx}.mp3")
                generate_audio(chunk, voice_id, chunk_file)
                audio_segments.append(AudioSegment.from_mp3(chunk_file))
                time.sleep(1)  # Rate limit buffer

            # Merge chunk audio files into a single chapter file
            combined = sum(audio_segments[1:], audio_segments[0])
            chapter_file = os.path.join(lang_dir, f"chapter_{ch_idx:02d}.mp3")
            combined.export(chapter_file, format="mp3", bitrate="192k")
            print(f"[{lang}] {chapter['title']} -> {chapter_file}")

            # Clean up chunk files
            for chunk_idx in range(1, len(chunks) + 1):
                os.remove(os.path.join(lang_dir, f"ch{ch_idx}_part{chunk_idx}.mp3"))

# Run the pipeline
create_audiobook(
    manuscript_path="book.txt",
    voice_id=voice_id,
    languages=["en", "ko", "ja", "es"]
)

Note: For true multilingual output, you should provide translated text for each language. The eleven_multilingual_v2 model handles pronunciation natively per language but does not perform translation itself. Pair this pipeline with a translation API such as Google Translate or DeepL for end-to-end multilingual production.

Pro Tips

Use voice settings strategically: Lower stability (0.3–0.4) adds expressiveness for fiction. Higher values (0.7–0.8) suit non-fiction and technical content.- Add SSML-like pauses: Insert ”…” or ”—” in text to create natural pauses between paragraphs and scene changes.- Batch with Projects API: For books over 100,000 characters, use the ElevenLabs Projects API (/v1/projects) which handles chunking, stitching, and chapter metadata automatically.- Monitor usage: Call GET /v1/user/subscription to check remaining character quota before long runs.- Cache voice IDs: Store cloned voice IDs in a config file. Re-cloning the same samples wastes quota and may produce slightly different voice profiles.- Post-process with FFmpeg: Normalize loudness across chapters: ffmpeg -i chapter_01.mp3 -af loudnorm -ar 44100 chapter_01_normalized.mp3

Troubleshooting

Error	Cause	Solution
`401 Unauthorized`	Invalid or missing API key	Verify `ELEVENLABS_API_KEY` in your `.env` file. Regenerate the key from the ElevenLabs dashboard if needed.
`422 text_too_long`	Text exceeds 5,000-character limit	Ensure `chunk_text()` uses a `max_chars` value below 5,000. Default 4,500 provides a safe buffer.
`429 Too Many Requests`	Rate limit exceeded	Increase `time.sleep()` delay between requests. Pro plans allow higher concurrency; check your plan limits.
Audio sounds robotic or distorted	Poor voice clone samples	Use clean, studio-quality recordings. Remove background noise. Provide at least 3 minutes of varied speech.
`pydub` import error	FFmpeg not installed	Install FFmpeg: `sudo apt install ffmpeg` (Linux) or `brew install ffmpeg` (macOS) or download from ffmpeg.org (Windows).
Wrong language pronunciation	Model mismatch	Use `eleven_multilingual_v2` for non-English text. The `eleven_monolingual_v1` model only supports English.

## FAQ

How many languages does ElevenLabs multilingual model support?

The eleven_multilingual_v2 model supports 29 languages including English, Korean, Japanese, Spanish, French, German, Chinese, Arabic, Hindi, and more. The cloned voice adapts its pronunciation to each target language automatically, though the quality is highest for languages with Latin and CJK scripts.

Can I use a cloned voice commercially for audiobooks?

Yes, but only if you have legal rights to the voice. You must own the voice (i.e., it is your own) or have explicit written consent from the voice owner. ElevenLabs’ terms require that cloned voices are used ethically and legally. Commercial audiobook distribution with a cloned voice requires at minimum a Pro plan.

What is the maximum book length this pipeline can handle?

There is no hard limit on book length in the code itself. The practical limit comes from your ElevenLabs character quota. A Pro plan provides approximately 500,000 characters per month. A typical 80,000-word novel is roughly 450,000 characters. For longer works or multiple languages, consider the Scale or Enterprise plan, or spread generation across billing cycles. The Projects API is recommended for books exceeding 100,000 characters as it provides built-in chunking and reliability features.

Explore More Tools