YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
VibeVoice Long-Form TTS Endpoint
Optimized for 10-30 minute audio generation with modern Python tooling.
π Files Overview
handler.py- Production handler for HuggingFace Inference Endpoints (requires NVIDIA GPU + Flash Attention 2)handler_dev.py- Development handler for Mac/local testing (CPU/MPS compatible)pyproject.toml- Modern dependency management for developmentrequirements.txt- Production dependencies for HuggingFace deployment
π Quick Start
For Mac/Local Development
# Install development dependencies (Mac-compatible, no Flash Attention)
uv sync
# Test API structure locally
python handler_dev.py
# Or test with custom input
python -c "
from handler_dev import DevelopmentVibeVoiceHandler
handler = DevelopmentVibeVoiceHandler()
result = handler({'inputs': 'Speaker 1: Hello world!'})
print(f'Success: {result[\"duration\"]}s mock audio generated')
"
For Production Deployment
- Deploy
handler.py+requirements.txtto HuggingFace Inference Endpoints - Requires NVIDIA GPU with 20GB+ VRAM and Flash Attention 2
π οΈ Development Setup (Mac-Compatible)
Prerequisites
- Python: 3.9+ (3.11 recommended)
- macOS: Apple Silicon (M1/M2/M3) or Intel
- uv: Fast Python package manager
Installation
# Clone/create project
mkdir vibevoice-longform-endpoint
cd vibevoice-longform-endpoint
# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Set up development environment
uv venv
source .venv/bin/activate
uv sync
# Test development handler
python handler_dev.py
Development Commands
# Install dependencies
uv sync
# Add new dependency
uv add package-name
# Add development dependency
uv add --dev package-name
# Run code formatting
uv run black .
uv run isort .
uv run ruff check .
# Test locally
python handler_dev.py
π― Production Deployment
Hardware Requirements
- GPU: NVIDIA GPU with 20GB+ VRAM (A10G, A100, RTX 4090)
- Compute: CUDA Compute Capability 7.5+ (RTX 2080+, Tesla V100+)
- Software: CUDA 11.6+ or 12.x, Flash Attention 2
Deployment Steps
Upload to HuggingFace Hub:
git init git add handler.py requirements.txt README.md git commit -m "VibeVoice production handler" git remote add origin https://huggingface.co/YOUR_USERNAME/vibevoice-longform git push -u origin mainCreate Inference Endpoint:
- Go to https://ui.endpoints.huggingface.co
- Select your repository
- Hardware: NVIDIA A10G (24GB) minimum
- Container: Default (PyTorch)
- Security: Protected
Test Production Endpoint:
import requests response = requests.post( "https://YOUR_ENDPOINT.endpoints.huggingface.cloud", headers={"Authorization": "Bearer YOUR_TOKEN"}, json={ "inputs": "Speaker 1: Hello! Speaker 2: Hi there!", "parameters": {"cfg_scale": 1.2, "ddmp_steps": 6} } )
ποΈ API Parameters
Core Parameters
cfg_scale(float, 1.0-1.5): Voice matching strength1.0-1.1: Faster, more natural speech1.2-1.3: Balanced (recommended)1.4-1.5: Better voice cloning, slower
ddmp_steps(int, 4-12): Quality vs speed trade-off4-5: Fastest generation6-8: Balanced (recommended for long content)10-12: Highest quality, slower
max_new_tokens(int): Maximum generation length4096: ~5-10 minutes8192: ~10-20 minutes (default)12288+: 20+ minutes
Input Formats
{
"inputs": "Speaker 1: Hello! Speaker 2: Hi there!",
"parameters": {
"voice_samples": ["base64_audio_1", "base64_audio_2"],
"speaker_names": ["Alice", "Bob"],
"cfg_scale": 1.2,
"ddmp_steps": 6,
"output_format": "wav"
}
}
π Expected Performance
| Content Length | Generation Time | Memory Usage | RTF |
|---|---|---|---|
| 10 minutes | 60-120s | 15-18GB | 0.1-0.2x |
| 20 minutes | 120-240s | 18-22GB | 0.1-0.2x |
| 30 minutes | 180-360s | 20-24GB | 0.1-0.2x |
RTF = Real-Time Factor (generation_time / audio_duration)
π¨ Troubleshooting
Mac Development Issues
Problem: flash-attn build fails on Mac
Solution: Use handler_dev.py for local development - it uses SDPA instead of Flash Attention
Problem: VibeVoice model not available
Solution: handler_dev.py includes mock mode for API structure testing
Production Issues
Problem: Flash Attention 2 not available Solution: Ensure NVIDIA GPU with CUDA 11.6+ and compute capability 7.5+
Problem: Out of memory Solution: Use A100 (40GB) or reduce content length
π Development Workflow
- Local Development: Use
handler_dev.pyon Mac to test API logic - API Testing: Test request/response structure without full model
- Production Deploy: Use
handler.pyon HuggingFace with NVIDIA GPU - Production Test: Validate with actual Flash Attention 2 and full model
π‘ Key Differences
| Aspect | Development (handler_dev.py) |
Production (handler.py) |
|---|---|---|
| Platform | Mac/CPU/MPS compatible | NVIDIA CUDA only |
| Attention | SDPA (compatible) | Flash Attention 2 (required) |
| Memory | Low requirements | 20GB+ VRAM needed |
| Purpose | API testing, logic validation | Actual audio generation |
| Output | Mock audio for testing | High-quality TTS audio |
β οΈ Important Notes
- Flash Attention 2 only works on NVIDIA GPUs - not compatible with Mac/Apple Silicon
- Development handler is for testing API structure only - use production handler for real audio
- VibeVoice model downloads automatically on HuggingFace Inference Endpoints
- Generated audio includes automatic watermarking and AI disclaimers