How to Create Multilingual Audiobooks with ElevenLabs API: Voice Cloning, Text Splitting & Chapter Automation in Python

Build a Complete Multilingual Audiobook Pipeline with ElevenLabs API

ElevenLabs provides one of the most advanced text-to-speech APIs available, offering voice cloning, multilingual synthesis, and high-fidelity audio generation. In this guide, you’ll build a complete Python workflow that takes a book manuscript, splits it into chapters, clones a voice, and generates professional-quality audiobook files in multiple languages — all automated.

Prerequisites and Installation

Before starting, ensure you have Python 3.9+ installed and an ElevenLabs account with API access. A Pro or Scale plan is recommended for voice cloning and higher character limits.

Step 1: Install Dependencies

pip install elevenlabs pydub requests python-dotenv

Step 2: Configure Your API Key

Create a .env file in your project root: ELEVENLABS_API_KEY=YOUR_API_KEY

Then load it in your Python script: import os from dotenv import load_dotenv

load_dotenv() API_KEY = os.getenv(“ELEVENLABS_API_KEY”)

Voice Cloning Setup

Step 3: Clone a Voice from Audio Samples

Instant Voice Cloning (IVC) requires at least one clean audio sample. For best results, provide 3–5 minutes of clear speech with minimal background noise. import requests

def clone_voice(name: str, sample_paths: list[str]) -> str: url = “https://api.elevenlabs.io/v1/voices/add” headers = {“xi-api-key”: API_KEY} data = { “name”: name, “description”: f”Cloned voice for audiobook narration - {name}”, “labels”: ’{“use_case”: “audiobook”, “language”: “multilingual”}’ } files = [ (“files”, (os.path.basename(p), open(p, “rb”), “audio/mpeg”)) for p in sample_paths ] response = requests.post(url, headers=headers, data=data, files=files) response.raise_for_status() voice_id = response.json()[“voice_id”] print(f”Voice cloned successfully. Voice ID: {voice_id}”) return voice_id

Usage

voice_id = clone_voice(“Narrator_EN”, [ “samples/narrator_sample1.mp3”, “samples/narrator_sample2.mp3” ])

Text Splitting and Chapter Detection

Step 4: Parse and Split Book Text into Chapters

ElevenLabs has a 5,000-character limit per API request. This function splits your manuscript by chapters and further chunks long chapters into API-safe segments. import re

def split_into_chapters(text: str) -> list[dict]: pattern = r”(Chapter\s+\d+[^\n]*)” parts = re.split(pattern, text, flags=re.IGNORECASE) chapters = [] for i in range(1, len(parts), 2): title = parts[i].strip() body = parts[i + 1].strip() if i + 1 < len(parts) else "" chapters.append({“title”: title, “body”: body}) if not chapters: chapters.append({“title”: “Full Text”, “body”: text.strip()}) return chapters

def chunk_text(text: str, max_chars: int = 4500) -> list[str]: sentences = re.split(r’(?<=[.!?])\s+’, text) chunks, current = [], "" for sentence in sentences: if len(current) + len(sentence) + 1 > max_chars: chunks.append(current.strip()) current = sentence else: current += ” ” + sentence if current.strip(): chunks.append(current.strip()) return chunks

Multilingual Audio Generation

Step 5: Generate Audio for Each Chapter and Language

ElevenLabs' eleven_multilingual_v2 model supports 29 languages. Specify the target language naturally in the text or use the model's automatic language detection. import time from pathlib import Path

TTS_URL = “https://api.elevenlabs.io/v1/text-to-speech/{voice_id}

def generate_audio(text: str, voice_id: str, output_path: str, model: str = “eleven_multilingual_v2”, stability: float = 0.5, similarity: float = 0.75) -> None: url = TTS_URL.format(voice_id=voice_id) headers = { “xi-api-key”: API_KEY, “Content-Type”: “application/json”, “Accept”: “audio/mpeg” } payload = { “text”: text, “model_id”: model, “voice_settings”: { “stability”: stability, “similarity_boost”: similarity, “style”: 0.0, “use_speaker_boost”: True } } response = requests.post(url, json=payload, headers=headers) response.raise_for_status() Path(output_path).parent.mkdir(parents=True, exist_ok=True) with open(output_path, “wb”) as f: f.write(response.content) print(f”Saved: {output_path}“)

Step 6: Orchestrate the Full Pipeline

from pydub import AudioSegment

def create_audiobook(manuscript_path: str, voice_id: str,
                     languages: list[str], output_dir: str = "output") -> None:
    with open(manuscript_path, "r", encoding="utf-8") as f:
        text = f.read()

    chapters = split_into_chapters(text)
    print(f"Found {len(chapters)} chapters")

    for lang in languages:
        lang_dir = os.path.join(output_dir, lang)
        for ch_idx, chapter in enumerate(chapters, 1):
            chunks = chunk_text(chapter["body"])
            audio_segments = []

            for chunk_idx, chunk in enumerate(chunks, 1):
                chunk_file = os.path.join(lang_dir, f"ch{ch_idx}_part{chunk_idx}.mp3")
                generate_audio(chunk, voice_id, chunk_file)
                audio_segments.append(AudioSegment.from_mp3(chunk_file))
                time.sleep(1)  # Rate limit buffer

            # Merge chunk audio files into a single chapter file
            combined = sum(audio_segments[1:], audio_segments[0])
            chapter_file = os.path.join(lang_dir, f"chapter_{ch_idx:02d}.mp3")
            combined.export(chapter_file, format="mp3", bitrate="192k")
            print(f"[{lang}] {chapter['title']} -> {chapter_file}")

            # Clean up chunk files
            for chunk_idx in range(1, len(chunks) + 1):
                os.remove(os.path.join(lang_dir, f"ch{ch_idx}_part{chunk_idx}.mp3"))

# Run the pipeline
create_audiobook(
    manuscript_path="book.txt",
    voice_id=voice_id,
    languages=["en", "ko", "ja", "es"]
)

Note: For true multilingual output, you should provide translated text for each language. The eleven_multilingual_v2 model handles pronunciation natively per language but does not perform translation itself. Pair this pipeline with a translation API such as Google Translate or DeepL for end-to-end multilingual production.

Pro Tips

  • Use voice settings strategically: Lower stability (0.3–0.4) adds expressiveness for fiction. Higher values (0.7–0.8) suit non-fiction and technical content.- Add SSML-like pauses: Insert ”…” or ”—” in text to create natural pauses between paragraphs and scene changes.- Batch with Projects API: For books over 100,000 characters, use the ElevenLabs Projects API (/v1/projects) which handles chunking, stitching, and chapter metadata automatically.- Monitor usage: Call GET /v1/user/subscription to check remaining character quota before long runs.- Cache voice IDs: Store cloned voice IDs in a config file. Re-cloning the same samples wastes quota and may produce slightly different voice profiles.- Post-process with FFmpeg: Normalize loudness across chapters: ffmpeg -i chapter_01.mp3 -af loudnorm -ar 44100 chapter_01_normalized.mp3

Troubleshooting

ErrorCauseSolution
401 UnauthorizedInvalid or missing API keyVerify ELEVENLABS_API_KEY in your .env file. Regenerate the key from the ElevenLabs dashboard if needed.
422 text_too_longText exceeds 5,000-character limitEnsure chunk_text() uses a max_chars value below 5,000. Default 4,500 provides a safe buffer.
429 Too Many RequestsRate limit exceededIncrease time.sleep() delay between requests. Pro plans allow higher concurrency; check your plan limits.
Audio sounds robotic or distortedPoor voice clone samplesUse clean, studio-quality recordings. Remove background noise. Provide at least 3 minutes of varied speech.
pydub import errorFFmpeg not installedInstall FFmpeg: sudo apt install ffmpeg (Linux) or brew install ffmpeg (macOS) or download from ffmpeg.org (Windows).
Wrong language pronunciationModel mismatchUse eleven_multilingual_v2 for non-English text. The eleven_monolingual_v1 model only supports English.
## FAQ

How many languages does ElevenLabs multilingual model support?

The eleven_multilingual_v2 model supports 29 languages including English, Korean, Japanese, Spanish, French, German, Chinese, Arabic, Hindi, and more. The cloned voice adapts its pronunciation to each target language automatically, though the quality is highest for languages with Latin and CJK scripts.

Can I use a cloned voice commercially for audiobooks?

Yes, but only if you have legal rights to the voice. You must own the voice (i.e., it is your own) or have explicit written consent from the voice owner. ElevenLabs’ terms require that cloned voices are used ethically and legally. Commercial audiobook distribution with a cloned voice requires at minimum a Pro plan.

What is the maximum book length this pipeline can handle?

There is no hard limit on book length in the code itself. The practical limit comes from your ElevenLabs character quota. A Pro plan provides approximately 500,000 characters per month. A typical 80,000-word novel is roughly 450,000 characters. For longer works or multiple languages, consider the Scale or Enterprise plan, or spread generation across billing cycles. The Projects API is recommended for books exceeding 100,000 characters as it provides built-in chunking and reliability features.

Explore More Tools

Grok Best Practices for Academic Research and Literature Discovery: Leveraging X/Twitter for Scholarly Intelligence Best Practices Grok Best Practices for Content Strategy: Identify Trending Topics Before They Peak and Create Content That Captures Demand Best Practices Grok Case Study: How a DTC Beauty Brand Used Real-Time Social Listening to Save Their Product Launch Case Study Grok Case Study: How a Pharma Company Tracked Patient Sentiment During a Drug Launch and Caught a Safety Signal 48 Hours Before the FDA Case Study Grok Case Study: How a Disaster Relief Nonprofit Used Real-Time X/Twitter Monitoring to Coordinate Emergency Response 3x Faster Case Study Grok Case Study: How a Political Campaign Used X/Twitter Sentiment Analysis to Reshape Messaging and Win a Swing District Case Study How to Use Grok for Competitive Intelligence: Track Product Launches, Pricing Changes, and Market Positioning in Real Time How-To Grok vs Perplexity vs ChatGPT Search for Real-Time Information: Which AI Search Tool Is Most Accurate in 2026? Comparison How to Use Grok for Crisis Communication Monitoring: Detect, Assess, and Respond to PR Emergencies in Real Time How-To How to Use Grok for Product Improvement: Extract Customer Feedback Signals from X/Twitter That Your Support Team Misses How-To How to Use Grok for Conference Live Monitoring: Extract Event Insights and Identify Networking Opportunities in Real Time How-To How to Use Grok for Influencer Marketing: Discover, Vet, and Track Influencer Partnerships Using Real X/Twitter Data How-To How to Use Grok for Job Market Analysis: Track Industry Hiring Trends, Layoff Signals, and Salary Discussions on X/Twitter How-To How to Use Grok for Investor Relations: Track Earnings Sentiment, Analyst Reactions, and Shareholder Concerns in Real Time How-To How to Use Grok for Recruitment and Talent Intelligence: Identifying Hiring Signals from X/Twitter Data How-To How to Use Grok for Startup Fundraising Intelligence: Track Investor Sentiment, VC Activity, and Funding Trends on X/Twitter How-To How to Use Grok for Regulatory Compliance Monitoring: Real-Time Policy Tracking Across Industries How-To NotebookLM Best Practices for Financial Analysts: Due Diligence, Investment Research & Risk Factor Analysis Across SEC Filings Best Practices NotebookLM Best Practices for Teachers: Build Curriculum-Aligned Lesson Plans, Study Guides, and Assessment Materials from Your Own Resources Best Practices NotebookLM Case Study: How an Insurance Company Built a Claims Processing Training System That Cut Errors by 35% Case Study