VibeVoice Long-Form TTS Endpoint

Optimized for 10-30 minute audio generation with modern Python tooling.

📁 Files Overview

handler.py - Production handler for HuggingFace Inference Endpoints (requires NVIDIA GPU + Flash Attention 2)
handler_dev.py - Development handler for Mac/local testing (CPU/MPS compatible)
pyproject.toml - Modern dependency management for development
requirements.txt - Production dependencies for HuggingFace deployment

🚀 Quick Start

For Mac/Local Development

# Install development dependencies (Mac-compatible, no Flash Attention)
uv sync

# Test API structure locally
python handler_dev.py

# Or test with custom input
python -c "
from handler_dev import DevelopmentVibeVoiceHandler
handler = DevelopmentVibeVoiceHandler()
result = handler({'inputs': 'Speaker 1: Hello world!'})
print(f'Success: {result[\"duration\"]}s mock audio generated')
"

For Production Deployment

Deploy handler.py + requirements.txt to HuggingFace Inference Endpoints
Requires NVIDIA GPU with 20GB+ VRAM and Flash Attention 2

🛠️ Development Setup (Mac-Compatible)

Prerequisites

Python: 3.9+ (3.11 recommended)
macOS: Apple Silicon (M1/M2/M3) or Intel
uv: Fast Python package manager

Installation

# Clone/create project
mkdir vibevoice-longform-endpoint
cd vibevoice-longform-endpoint

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Set up development environment
uv venv
source .venv/bin/activate
uv sync

# Test development handler
python handler_dev.py

Development Commands

# Install dependencies
uv sync

# Add new dependency
uv add package-name

# Add development dependency
uv add --dev package-name

# Run code formatting
uv run black .
uv run isort .
uv run ruff check .

# Test locally
python handler_dev.py

🎯 Production Deployment

Hardware Requirements

GPU: NVIDIA GPU with 20GB+ VRAM (A10G, A100, RTX 4090)
Compute: CUDA Compute Capability 7.5+ (RTX 2080+, Tesla V100+)
Software: CUDA 11.6+ or 12.x, Flash Attention 2

Deployment Steps

Upload to HuggingFace Hub:

git init
git add handler.py requirements.txt README.md
git commit -m "VibeVoice production handler"
git remote add origin https://huggingface.co/YOUR_USERNAME/vibevoice-longform
git push -u origin main

Create Inference Endpoint:
- Go to https://ui.endpoints.huggingface.co
- Select your repository
- Hardware: NVIDIA A10G (24GB) minimum
- Container: Default (PyTorch)
- Security: Protected

Test Production Endpoint:

import requests

response = requests.post(
    "https://YOUR_ENDPOINT.endpoints.huggingface.cloud",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    json={
        "inputs": "Speaker 1: Hello! Speaker 2: Hi there!",
        "parameters": {"cfg_scale": 1.2, "ddmp_steps": 6}
    }
)

🎛️ API Parameters

Core Parameters

cfg_scale (float, 1.0-1.5): Voice matching strength
- 1.0-1.1: Faster, more natural speech
- 1.2-1.3: Balanced (recommended)
- 1.4-1.5: Better voice cloning, slower
ddmp_steps (int, 4-12): Quality vs speed trade-off
- 4-5: Fastest generation
- 6-8: Balanced (recommended for long content)
- 10-12: Highest quality, slower
max_new_tokens (int): Maximum generation length
- 4096: ~5-10 minutes
- 8192: ~10-20 minutes (default)
- 12288+: 20+ minutes

Input Formats

{
  "inputs": "Speaker 1: Hello! Speaker 2: Hi there!",
  "parameters": {
    "voice_samples": ["base64_audio_1", "base64_audio_2"],
    "speaker_names": ["Alice", "Bob"],
    "cfg_scale": 1.2,
    "ddmp_steps": 6,
    "output_format": "wav"
  }
}

📊 Expected Performance

Content Length	Generation Time	Memory Usage	RTF
10 minutes	60-120s	15-18GB	0.1-0.2x
20 minutes	120-240s	18-22GB	0.1-0.2x
30 minutes	180-360s	20-24GB	0.1-0.2x

RTF = Real-Time Factor (generation_time / audio_duration)

🚨 Troubleshooting

Mac Development Issues

Problem: flash-attn build fails on Mac Solution: Use handler_dev.py for local development - it uses SDPA instead of Flash Attention

Problem: VibeVoice model not available Solution: handler_dev.py includes mock mode for API structure testing

Production Issues

Problem: Flash Attention 2 not available Solution: Ensure NVIDIA GPU with CUDA 11.6+ and compute capability 7.5+

Problem: Out of memory Solution: Use A100 (40GB) or reduce content length

🔄 Development Workflow

Local Development: Use handler_dev.py on Mac to test API logic
API Testing: Test request/response structure without full model
Production Deploy: Use handler.py on HuggingFace with NVIDIA GPU
Production Test: Validate with actual Flash Attention 2 and full model

💡 Key Differences

Aspect	Development (`handler_dev.py`)	Production (`handler.py`)
Platform	Mac/CPU/MPS compatible	NVIDIA CUDA only
Attention	SDPA (compatible)	Flash Attention 2 (required)
Memory	Low requirements	20GB+ VRAM needed
Purpose	API testing, logic validation	Actual audio generation
Output	Mock audio for testing	High-quality TTS audio

⚠️ Important Notes

Flash Attention 2 only works on NVIDIA GPUs - not compatible with Mac/Apple Silicon
Development handler is for testing API structure only - use production handler for real audio
VibeVoice model downloads automatically on HuggingFace Inference Endpoints
Generated audio includes automatic watermarking and AI disclaimers

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support