marcos Claude Opus 4.5 commited on
Commit
ba6f7d2
Β·
1 Parent(s): d805b79

Add comprehensive README with documentation

Browse files

- Complete project documentation with features and architecture
- Performance benchmarks (RTF) for StyleTTS2 and MuseTalk
- Quick start guide and usage examples
- Critical package versions and troubleshooting guide
- Video input recommendations for optimal lip sync

πŸ€– Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +139 -0
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI Video Avatar Setup
2
+
3
+ Complete setup for real-time AI video avatars with voice cloning and lip sync.
4
+
5
+ ## Features
6
+
7
+ - **Voice Cloning**: Clone any voice from a short audio sample using StyleTTS2
8
+ - **Lip Sync**: High-quality lip synchronization using MuseTalk V1.5
9
+ - **Real-Time Capable**: RTF < 1 (faster than real-time) on RTX 3090+
10
+ - **One-Click Install**: Automated setup script for cloud GPUs
11
+
12
+ ## Performance (RTX 3090)
13
+
14
+ | Component | RTF | Speed |
15
+ |-----------|-----|-------|
16
+ | StyleTTS2 (steps=5) | 0.04 | 22x real-time |
17
+ | MuseTalk V1.5 | 0.25-0.67 | 1.5-4x real-time |
18
+
19
+ ## Requirements
20
+
21
+ - NVIDIA GPU with CUDA 12.x (tested on RTX 3090, V100)
22
+ - Ubuntu 22.04+ or similar
23
+ - Python 3.10+
24
+ - ~20GB disk space for models
25
+
26
+ ## Quick Start
27
+
28
+ ```bash
29
+ # Clone and install
30
+ git clone https://github.com/yourusername/ai-video-setup.git
31
+ cd ai-video-setup
32
+ chmod +x install.sh
33
+ ./install.sh
34
+
35
+ # Generate audio with cloned voice
36
+ python scripts/generate_audio.py --text "Hello world" --voice voice_ref.wav -o output.wav
37
+
38
+ # Run lip sync
39
+ python scripts/run_lipsync.py --video avatar.mp4 --audio output.wav -o ./output
40
+
41
+ # Or run the full pipeline
42
+ python scripts/full_pipeline.py --youtube-url "https://..." --text "Your text" -o ./output
43
+ ```
44
+
45
+ ## Critical Package Versions
46
+
47
+ These versions are **required** and tested to work together:
48
+
49
+ ```
50
+ accelerate==0.25.0
51
+ diffusers==0.21.0
52
+ huggingface-hub==0.25.0
53
+ ```
54
+
55
+ > **Warning**: Newer versions cause `cannot import clear_device_cache` error.
56
+
57
+ ## Scripts
58
+
59
+ | Script | Description |
60
+ |--------|-------------|
61
+ | `install.sh` | Complete installation (PyTorch, MuseTalk, StyleTTS2) |
62
+ | `scripts/generate_audio.py` | Generate audio with voice cloning |
63
+ | `scripts/run_lipsync.py` | Run lip sync on video |
64
+ | `scripts/extract_voice_ref.py` | Extract voice reference from YouTube/video |
65
+ | `scripts/full_pipeline.py` | Complete pipeline (YouTube -> lip sync video) |
66
+ | `scripts/realtime_avatar.py` | Real-time avatar with pre-loaded models |
67
+
68
+ ## Video Input Recommendations
69
+
70
+ For best lip sync results:
71
+
72
+ - **Duration**: 15-30 seconds (more frames = better variety)
73
+ - **Resolution**: 640x360 to 720p (larger is slower, not better)
74
+ - **FPS**: 24-30 fps
75
+ - **Content**: Face centered, good lighting, neutral expression
76
+ - **Format**: MP4 with H.264 codec
77
+
78
+ ### Preprocessing Example
79
+
80
+ ```bash
81
+ # Convert 4K video to optimal format
82
+ ffmpeg -i input_4k.mp4 -vf "scale=640:-2,fps=24" -c:a copy avatar.mp4
83
+ ```
84
+
85
+ ## Architecture
86
+
87
+ ```
88
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
89
+ β”‚ Input Text │────▢│ StyleTTS2 │────▢│ Audio WAV β”‚
90
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (Voice Clone) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
91
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
92
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
93
+ β”‚ Avatar Video │───────────────────────────────────────
94
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
95
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
96
+ β”‚ MuseTalk V1.5 β”‚β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
97
+ β”‚ (Lip Sync) β”‚
98
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
99
+ β”‚
100
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
101
+ β”‚ Output Video β”‚
102
+ β”‚ (with audio) β”‚
103
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
104
+ ```
105
+
106
+ ## Troubleshooting
107
+
108
+ ### "cannot import clear_device_cache"
109
+ ```bash
110
+ pip install accelerate==0.25.0 diffusers==0.21.0 huggingface-hub==0.25.0
111
+ ```
112
+
113
+ ### PyTorch 2.6 pickle error
114
+ The scripts include a fix for `weights_only` parameter. If you see pickle errors, ensure the patch is applied:
115
+ ```python
116
+ import torch
117
+ original_load = torch.load
118
+ def patched_load(*args, **kwargs):
119
+ kwargs['weights_only'] = False
120
+ return original_load(*args, **kwargs)
121
+ torch.load = patched_load
122
+ ```
123
+
124
+ ### NLTK punkt error
125
+ ```bash
126
+ python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"
127
+ ```
128
+
129
+ ## License
130
+
131
+ This project integrates:
132
+ - [StyleTTS2](https://github.com/yl4579/StyleTTS2) - MIT License
133
+ - [MuseTalk](https://github.com/TMElyralab/MuseTalk) - Custom License
134
+
135
+ ## Acknowledgments
136
+
137
+ - StyleTTS2 team for the amazing TTS model
138
+ - TMElyralab for MuseTalk V1.5
139
+ - Tested on vast.ai GPU instances