Add comprehensive README with documentation

- Complete project documentation with features and architecture
- Performance benchmarks (RTF) for StyleTTS2 and MuseTalk
- Quick start guide and usage examples
- Critical package versions and troubleshooting guide
- Video input recommendations for optimal lip sync

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show

README.md +139 -0

README.md ADDED Viewed

	@@ -0,0 +1,139 @@

+# AI Video Avatar Setup
+Complete setup for real-time AI video avatars with voice cloning and lip sync.
+## Features
+- **Voice Cloning**: Clone any voice from a short audio sample using StyleTTS2
+- **Lip Sync**: High-quality lip synchronization using MuseTalk V1.5
+- **Real-Time Capable**: RTF < 1 (faster than real-time) on RTX 3090+
+- **One-Click Install**: Automated setup script for cloud GPUs
+## Performance (RTX 3090)
+| Component | RTF | Speed |
+|-----------|-----|-------|
+| StyleTTS2 (steps=5) | 0.04 | 22x real-time |
+| MuseTalk V1.5 | 0.25-0.67 | 1.5-4x real-time |
+## Requirements
+- NVIDIA GPU with CUDA 12.x (tested on RTX 3090, V100)
+- Ubuntu 22.04+ or similar
+- Python 3.10+
+- ~20GB disk space for models
+## Quick Start
+```bash
+# Clone and install
+git clone https://github.com/yourusername/ai-video-setup.git
+cd ai-video-setup
+chmod +x install.sh
+./install.sh
+# Generate audio with cloned voice
+python scripts/generate_audio.py --text "Hello world" --voice voice_ref.wav -o output.wav
+# Run lip sync
+python scripts/run_lipsync.py --video avatar.mp4 --audio output.wav -o ./output
+# Or run the full pipeline
+python scripts/full_pipeline.py --youtube-url "https://..." --text "Your text" -o ./output
+```
+## Critical Package Versions
+These versions are **required** and tested to work together:
+```
+accelerate==0.25.0
+diffusers==0.21.0
+huggingface-hub==0.25.0
+```
+> **Warning**: Newer versions cause `cannot import clear_device_cache` error.
+## Scripts
+| Script | Description |
+|--------|-------------|
+| `install.sh` | Complete installation (PyTorch, MuseTalk, StyleTTS2) |
+| `scripts/generate_audio.py` | Generate audio with voice cloning |
+| `scripts/run_lipsync.py` | Run lip sync on video |
+| `scripts/extract_voice_ref.py` | Extract voice reference from YouTube/video |
+| `scripts/full_pipeline.py` | Complete pipeline (YouTube -> lip sync video) |
+| `scripts/realtime_avatar.py` | Real-time avatar with pre-loaded models |
+## Video Input Recommendations
+For best lip sync results:
+- **Duration**: 15-30 seconds (more frames = better variety)
+- **Resolution**: 640x360 to 720p (larger is slower, not better)
+- **FPS**: 24-30 fps
+- **Content**: Face centered, good lighting, neutral expression
+- **Format**: MP4 with H.264 codec
+### Preprocessing Example
+```bash
+# Convert 4K video to optimal format
+ffmpeg -i input_4k.mp4 -vf "scale=640:-2,fps=24" -c:a copy avatar.mp4
+```
+## Architecture
+```
+┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
+│   Input Text    │────▶│   StyleTTS2     │────▶│   Audio WAV     │
+└─────────────────┘     │  (Voice Clone)  │     └────────┬────────┘
+                        └─────────────────┘              │
+┌─────────────────┐                                      │
+│  Avatar Video   │──────────────────────────────────────┤
+└─────────────────┘                                      │
+                        ┌─────────────────┐              │
+                        │  MuseTalk V1.5  │◀─────────────┘
+                        │   (Lip Sync)    │
+                        └────────┬────────┘
+                                 │
+                        ┌────────▼────────┐
+                        │  Output Video   │
+                        │  (with audio)   │
+                        └─────────────────┘
+```
+## Troubleshooting
+### "cannot import clear_device_cache"
+```bash
+pip install accelerate==0.25.0 diffusers==0.21.0 huggingface-hub==0.25.0
+```
+### PyTorch 2.6 pickle error
+The scripts include a fix for `weights_only` parameter. If you see pickle errors, ensure the patch is applied:
+```python
+import torch
+original_load = torch.load
+def patched_load(*args, **kwargs):
+    kwargs['weights_only'] = False
+    return original_load(*args, **kwargs)
+torch.load = patched_load
+```
+### NLTK punkt error
+```bash
+python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"
+```
+## License
+This project integrates:
+- [StyleTTS2](https://github.com/yl4579/StyleTTS2) - MIT License
+- [MuseTalk](https://github.com/TMElyralab/MuseTalk) - Custom License
+## Acknowledgments
+- StyleTTS2 team for the amazing TTS model
+- TMElyralab for MuseTalk V1.5
+- Tested on vast.ai GPU instances