manim-mcp / ELEVENLABS_SETUP.md
bhaveshgoel07's picture
Deploy code fixes (clean history)
fff13d1

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

ElevenLabs TTS Setup Guide

Overview

ElevenLabs provides high-quality, natural-sounding text-to-speech (TTS) that significantly improves the audio quality of your animations compared to free alternatives.

Why ElevenLabs?

  • βœ… Superior Quality: Most natural-sounding AI voices available
  • βœ… Fast Generation: Typically < 5 seconds for narration
  • βœ… Reliable: Consistent output, no blank audio issues
  • βœ… Multiple Voices: Wide selection of voices for different styles
  • βœ… Emotional Range: Voices can convey emotion and emphasis

Getting Started

Step 1: Create an ElevenLabs Account

  1. Go to elevenlabs.io
  2. Click "Sign Up" (top right)
  3. Choose a plan:
    • Free Tier: 10,000 characters/month (~10 animations)
    • Starter: $5/month for 30,000 characters
    • Creator: $22/month for 100,000 characters
    • Pro: $99/month for 500,000 characters

Step 2: Get Your API Key

  1. Log in to your ElevenLabs account
  2. Click your profile icon (top right)
  3. Select "Profile"
  4. Find the "API Key" section
  5. Click "Copy" to copy your API key
    • It looks like: sk_abc123def456...

Step 3: Configure the Project

Option A: Environment Variable (Recommended)

Create or edit .env file in the project root:

# ElevenLabs Configuration
ELEVENLABS_API_KEY=sk_your_actual_api_key_here

# Optional: Hugging Face as fallback
HUGGINGFACE_API_KEY=hf_your_huggingface_key_here

Option B: Command Line Argument

python orchestrator.py "photosynthesis" --elevenlabs-key sk_your_api_key_here

Option C: Programmatic

from orchestrator import NeuroAnimOrchestrator

orchestrator = NeuroAnimOrchestrator(
    elevenlabs_api_key="sk_your_api_key_here",
    hf_api_key="hf_your_fallback_key_here"
)

Step 4: Install Dependencies

# Activate your virtual environment
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate  # Windows

# Install required packages
pip install httpx gtts pydub

Available Voices

The system comes with 9 pre-configured professional voices:

Voice Name ID Description Best For
rachel 21m00Tcm4TlvDq8ikWAM Clear, neutral female Educational content, narration
adam pNInz6obpgDQGcFmaJgB Deep, confident male Documentary, serious topics
antoni ErXwobaYiN019PkySvjV Well-rounded male General narration
arnold VR6AewLTigWG4xSOukaG Crisp, articulate male Technical content
bella EXAVITQu4vr4xnSDxMaL Soft, gentle female Children's content
domi AZnzlk1XvdvUeBnXmlld Strong female Assertive narration
elli MF3mGyEYCl7XYWbV9V6O Emotional, expressive female Storytelling
josh TxGEqnHWrfWFTfGW9XjX Young, energetic male Youth content
sam yoZ06aMxZJJ28mfd3POQ Raspy male Character voices

Using a Specific Voice

# In your code
tts_result = await tts_generator.generate_speech(
    text="Your narration text",
    output_path=audio_file,
    voice="adam"  # Change to any voice name
)

Using Custom Voices

If you've created custom voices in ElevenLabs:

# Use the voice ID directly
tts_result = await tts_generator.generate_speech(
    text="Your narration text",
    output_path=audio_file,
    voice="your_custom_voice_id_here"
)

Advanced Configuration

Voice Settings

You can fine-tune voice characteristics:

tts_result = await tts_generator.generate_speech(
    text="Your narration text",
    output_path=audio_file,
    voice="rachel",
    stability=0.5,           # 0.0-1.0: Lower = more expressive, Higher = more stable
    similarity_boost=0.75,   # 0.0-1.0: Higher = more similar to original voice
    style=0.0,              # 0.0-1.0: Style exaggeration
    use_speaker_boost=True  # Enhance clarity
)

Stability

  • Low (0.0-0.3): More expressive and variable, good for storytelling
  • Medium (0.4-0.6): Balanced, good for most content (default: 0.5)
  • High (0.7-1.0): Very consistent, good for audiobooks

Similarity Boost

  • Low (0.0-0.4): More creative interpretation
  • Medium (0.5-0.7): Balanced (default: 0.75)
  • High (0.8-1.0): Closest to the original voice

Model Selection

ElevenLabs offers different models:

tts_result = await tts_generator.generate_speech(
    text="Your narration text",
    output_path=audio_file,
    voice="rachel",
    model_id="eleven_monolingual_v1"  # Default, English only, fastest
    # model_id="eleven_multilingual_v2"  # Supports multiple languages
    # model_id="eleven_turbo_v2"  # Faster, slightly lower quality
)

Testing Your Setup

Quick Test Script

Create test_tts.py:

import asyncio
from pathlib import Path
from utils.tts import generate_speech_elevenlabs

async def test_elevenlabs():
    """Test ElevenLabs TTS."""
    text = "Hello! This is a test of ElevenLabs text to speech."
    output = Path("test_audio.mp3")
    
    try:
        result = await generate_speech_elevenlabs(
            text=text,
            output_path=output,
            voice="rachel"
        )
        print(f"βœ… Success! Audio saved to: {output}")
        print(f"Provider: {result['provider']}")
        print(f"File size: {result['file_size_bytes']} bytes")
        
    except Exception as e:
        print(f"❌ Error: {e}")

if __name__ == "__main__":
    asyncio.run(test_elevenlabs())

Run it:

python test_tts.py

Test All Voices

import asyncio
from pathlib import Path
from utils.tts import TTSGenerator

async def test_all_voices():
    """Generate samples of all available voices."""
    tts = TTSGenerator()
    voices = await tts.get_available_voices()
    
    text = "This is a sample of my voice for educational animations."
    
    for voice_name in ["rachel", "adam", "bella"]:
        output = Path(f"voice_sample_{voice_name}.mp3")
        print(f"Generating {voice_name}...")
        
        result = await tts.generate_speech(
            text=text,
            output_path=output,
            voice=voice_name
        )
        print(f"βœ… {voice_name}: {output}")

if __name__ == "__main__":
    asyncio.run(test_all_voices())

How the Fallback System Works

The TTS system has automatic fallback:

1. Try ElevenLabs (if API key available)
   ↓ (if fails)
2. Try Hugging Face TTS (if API key available)
   ↓ (if fails)
3. Try Google TTS (free, always available)

You can disable fallback:

tts_generator = TTSGenerator(
    elevenlabs_api_key="your_key",
    fallback_enabled=False  # Fail immediately if ElevenLabs fails
)

Monitoring Usage

Check Your Usage

  1. Go to elevenlabs.io
  2. Log in
  3. Click "Usage" in the sidebar
  4. View your character usage and remaining quota

Estimate Costs

Rule of thumb: 1 minute of narration β‰ˆ 150-200 words β‰ˆ 900-1200 characters

Free Tier (10,000 chars/month):

  • ~8-10 minutes of narration
  • ~8-10 animations (assuming 1 min each)

Starter ($5/month, 30,000 chars):

  • ~25-30 minutes of narration
  • ~25-30 animations

Creator ($22/month, 100,000 chars):

  • ~80-100 minutes of narration
  • ~80-100 animations

Troubleshooting

Problem: "ElevenLabs API key not provided"

Solution:

  1. Check your .env file exists
  2. Verify ELEVENLABS_API_KEY=sk_... is set correctly
  3. No quotes around the key
  4. No spaces around the =

Problem: "401 Unauthorized"

Solutions:

  1. API key is invalid
  2. API key has expired
  3. Account has been suspended
  4. Check your key at elevenlabs.io/profile

Problem: "429 Too Many Requests"

Solutions:

  1. You've exceeded your quota
  2. Wait for quota to reset (monthly)
  3. Upgrade your plan
  4. Enable fallback to HuggingFace/gTTS

Problem: "Audio file is blank/silent"

Solutions:

  1. Check the output file size (should be > 10KB)
  2. Try a different voice
  3. Check if text is too short (< 10 chars)
  4. Verify audio format is compatible

Problem: "Slow generation"

Solutions:

  1. Use eleven_turbo_v2 model
  2. Check your internet connection
  3. Reduce text length (split long narrations)
  4. Consider caching commonly used phrases

Problem: "Import Error: No module named 'httpx'"

Solution:

pip install httpx gtts pydub

Best Practices

1. Text Preparation

  • Use proper punctuation: Helps with natural pauses
  • Avoid special characters: Stick to alphanumeric and basic punctuation
  • Break long text: Split into shorter segments for better pacing
  • Add pauses: Use ... for longer pauses

Example:

text = """
Photosynthesis is the process by which plants create energy.
It happens in the chloroplasts... using sunlight, water, and carbon dioxide.
The result? Glucose and oxygen!
"""

2. Voice Selection

  • Educational content: Rachel, Arnold
  • Storytelling: Elli, Antoni
  • Technical topics: Adam, Arnold
  • Children's content: Bella, Josh

3. Caching

For repeated phrases, cache the audio:

import hashlib
from pathlib import Path

def get_cached_audio(text: str, voice: str) -> Path:
    """Get cached audio or generate if not exists."""
    text_hash = hashlib.md5(f"{text}:{voice}".encode()).hexdigest()
    cache_path = Path(f"audio_cache/{text_hash}.mp3")
    
    if cache_path.exists():
        return cache_path
    
    # Generate and cache
    cache_path.parent.mkdir(exist_ok=True)
    # ... generate audio ...
    return cache_path

4. Error Handling

Always handle TTS errors gracefully:

try:
    audio = await tts_generator.generate_speech(...)
except Exception as e:
    logger.error(f"TTS failed: {e}")
    # Use fallback or text overlay instead
    return None

Security Best Practices

βœ… DO:

  • Store API keys in .env file
  • Add .env to .gitignore
  • Use environment variables in production
  • Rotate keys periodically
  • Use separate keys for dev/prod

❌ DON'T:

  • Commit API keys to git
  • Share keys in public forums
  • Hard-code keys in source files
  • Use production keys for testing
  • Share keys between team members

Cost Optimization Tips

  1. Use Free Tier First: Test with 10k chars/month
  2. Enable Fallback: Save quota by using free alternatives when needed
  3. Cache Audio: Don't regenerate same narration
  4. Optimize Text: Remove unnecessary words
  5. Batch Processing: Generate multiple animations in one session
  6. Monitor Usage: Set alerts in ElevenLabs dashboard

Getting Help

ElevenLabs Support

Project Issues

  • GitHub Issues: [Your repo URL]
  • Documentation: See README.md
  • Examples: See example.py

Alternative TTS Providers

If ElevenLabs doesn't work for you:

Hugging Face (Free)

HUGGINGFACE_API_KEY=hf_your_key_here
  • Pros: Free, open source
  • Cons: Lower quality, slower

Google TTS (Free)

# No API key needed, automatic fallback
  • Pros: Free, reliable, fast
  • Cons: Robotic voice, limited customization

AWS Polly

# Requires AWS credentials
  • Pros: Good quality, many voices
  • Cons: AWS complexity, pay-per-use

Azure TTS

# Requires Azure subscription
  • Pros: Good quality, multilingual
  • Cons: Microsoft ecosystem, pricing

Next Steps

  1. βœ… Set up your API key
  2. βœ… Test with test_tts.py
  3. βœ… Generate your first animation
  4. βœ… Experiment with different voices
  5. βœ… Optimize settings for your content

Happy animating! πŸŽ¬πŸŽ™οΈ