YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

VibeVoice Long-Form TTS Endpoint

Optimized for 10-30 minute audio generation with modern Python tooling.

πŸ“ Files Overview

  • handler.py - Production handler for HuggingFace Inference Endpoints (requires NVIDIA GPU + Flash Attention 2)
  • handler_dev.py - Development handler for Mac/local testing (CPU/MPS compatible)
  • pyproject.toml - Modern dependency management for development
  • requirements.txt - Production dependencies for HuggingFace deployment

πŸš€ Quick Start

For Mac/Local Development

# Install development dependencies (Mac-compatible, no Flash Attention)
uv sync

# Test API structure locally
python handler_dev.py

# Or test with custom input
python -c "
from handler_dev import DevelopmentVibeVoiceHandler
handler = DevelopmentVibeVoiceHandler()
result = handler({'inputs': 'Speaker 1: Hello world!'})
print(f'Success: {result[\"duration\"]}s mock audio generated')
"

For Production Deployment

  1. Deploy handler.py + requirements.txt to HuggingFace Inference Endpoints
  2. Requires NVIDIA GPU with 20GB+ VRAM and Flash Attention 2

πŸ› οΈ Development Setup (Mac-Compatible)

Prerequisites

  • Python: 3.9+ (3.11 recommended)
  • macOS: Apple Silicon (M1/M2/M3) or Intel
  • uv: Fast Python package manager

Installation

# Clone/create project
mkdir vibevoice-longform-endpoint
cd vibevoice-longform-endpoint

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Set up development environment
uv venv
source .venv/bin/activate
uv sync

# Test development handler
python handler_dev.py

Development Commands

# Install dependencies
uv sync

# Add new dependency
uv add package-name

# Add development dependency
uv add --dev package-name

# Run code formatting
uv run black .
uv run isort .
uv run ruff check .

# Test locally
python handler_dev.py

🎯 Production Deployment

Hardware Requirements

  • GPU: NVIDIA GPU with 20GB+ VRAM (A10G, A100, RTX 4090)
  • Compute: CUDA Compute Capability 7.5+ (RTX 2080+, Tesla V100+)
  • Software: CUDA 11.6+ or 12.x, Flash Attention 2

Deployment Steps

  1. Upload to HuggingFace Hub:

    git init
    git add handler.py requirements.txt README.md
    git commit -m "VibeVoice production handler"
    git remote add origin https://huggingface.co/YOUR_USERNAME/vibevoice-longform
    git push -u origin main
    
  2. Create Inference Endpoint:

  3. Test Production Endpoint:

    import requests
    
    response = requests.post(
        "https://YOUR_ENDPOINT.endpoints.huggingface.cloud",
        headers={"Authorization": "Bearer YOUR_TOKEN"},
        json={
            "inputs": "Speaker 1: Hello! Speaker 2: Hi there!",
            "parameters": {"cfg_scale": 1.2, "ddmp_steps": 6}
        }
    )
    

πŸŽ›οΈ API Parameters

Core Parameters

  • cfg_scale (float, 1.0-1.5): Voice matching strength

    • 1.0-1.1: Faster, more natural speech
    • 1.2-1.3: Balanced (recommended)
    • 1.4-1.5: Better voice cloning, slower
  • ddmp_steps (int, 4-12): Quality vs speed trade-off

    • 4-5: Fastest generation
    • 6-8: Balanced (recommended for long content)
    • 10-12: Highest quality, slower
  • max_new_tokens (int): Maximum generation length

    • 4096: ~5-10 minutes
    • 8192: ~10-20 minutes (default)
    • 12288+: 20+ minutes

Input Formats

{
  "inputs": "Speaker 1: Hello! Speaker 2: Hi there!",
  "parameters": {
    "voice_samples": ["base64_audio_1", "base64_audio_2"],
    "speaker_names": ["Alice", "Bob"],
    "cfg_scale": 1.2,
    "ddmp_steps": 6,
    "output_format": "wav"
  }
}

πŸ“Š Expected Performance

Content Length Generation Time Memory Usage RTF
10 minutes 60-120s 15-18GB 0.1-0.2x
20 minutes 120-240s 18-22GB 0.1-0.2x
30 minutes 180-360s 20-24GB 0.1-0.2x

RTF = Real-Time Factor (generation_time / audio_duration)

🚨 Troubleshooting

Mac Development Issues

Problem: flash-attn build fails on Mac Solution: Use handler_dev.py for local development - it uses SDPA instead of Flash Attention

Problem: VibeVoice model not available Solution: handler_dev.py includes mock mode for API structure testing

Production Issues

Problem: Flash Attention 2 not available Solution: Ensure NVIDIA GPU with CUDA 11.6+ and compute capability 7.5+

Problem: Out of memory Solution: Use A100 (40GB) or reduce content length

πŸ”„ Development Workflow

  1. Local Development: Use handler_dev.py on Mac to test API logic
  2. API Testing: Test request/response structure without full model
  3. Production Deploy: Use handler.py on HuggingFace with NVIDIA GPU
  4. Production Test: Validate with actual Flash Attention 2 and full model

πŸ’‘ Key Differences

Aspect Development (handler_dev.py) Production (handler.py)
Platform Mac/CPU/MPS compatible NVIDIA CUDA only
Attention SDPA (compatible) Flash Attention 2 (required)
Memory Low requirements 20GB+ VRAM needed
Purpose API testing, logic validation Actual audio generation
Output Mock audio for testing High-quality TTS audio

⚠️ Important Notes

  • Flash Attention 2 only works on NVIDIA GPUs - not compatible with Mac/Apple Silicon
  • Development handler is for testing API structure only - use production handler for real audio
  • VibeVoice model downloads automatically on HuggingFace Inference Endpoints
  • Generated audio includes automatic watermarking and AI disclaimers
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support