ShalomKing commited on
Commit
f076b1f
Β·
verified Β·
1 Parent(s): 4eab2ed

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. AGENTS.md +38 -0
  2. CLAUDE.md +175 -0
  3. README.md +4 -2
  4. app.py +6 -3
  5. requirements.txt +17 -8
AGENTS.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Repository Guidelines
2
+
3
+ ## Project Structure & Module Organization
4
+ - `app.py`: Gradio entrypoint orchestrating model initialization, audio preprocessing, and inference flow.
5
+ - `src/`: Supporting modules (`audio_analysis/` for wav2vec2 utilities, `vram_management/` for GPU-safe layers, `utils.py` helpers).
6
+ - `utils/`: Infrastructure helpers (`model_loader.py` for WAN/InfiniteTalk weights, `gpu_manager.py` for memory checks/cleanup).
7
+ - `wan/`: Upstream InfiniteTalk model code; treat as vendor code when updating.
8
+ - `assets/` and `examples/`: UI assets and sample media for quick demos; safe to extend.
9
+ - `requirements.txt`, `packages.txt`, `Dockerfile`: Deployment dependencies (note: PyTorch + flash-attn installed via Dockerfile/HF build, not from requirements).
10
+
11
+ ## Setup, Build, and Local Run
12
+ - Create an isolated env: `python -m venv .venv && source .venv/bin/activate`.
13
+ - Install Python deps: `pip install -r requirements.txt` (PyTorch/flash-attn come from the base image or HuggingFace Space build).
14
+ - Launch UI locally: `python app.py` (Gradio on port 7860 by default).
15
+ - Quick sanity check: `python -m py_compile app.py` to catch syntax errors before pushing.
16
+ - Docker-based run (mirrors HF build): `docker build -t infinitetalk . && docker run -p 7860:7860 infinitetalk`.
17
+
18
+ ## Coding Style & Naming Conventions
19
+ - Python 3.10+, PEP 8 with 4-space indentation; favor type hints where practical.
20
+ - Functions/variables: `snake_case`; classes: `PascalCase`; constants: `UPPER_SNAKE_CASE`.
21
+ - Prefer `logging` over `print` (consistent with existing modules); keep log level INFO for user-facing runs.
22
+ - Add concise docstrings for public functions; keep module-level comments minimal and purposeful.
23
+
24
+ ## Testing Guidelines
25
+ - No automated test suite yet; aim to add `pytest`-style tests under `tests/` mirroring `src/` modules.
26
+ - Until then, validate with: (1) `python -m py_compile` for syntax, (2) short inference smoke test using `examples/` media at 480p/30–40 steps.
27
+ - When adding tests, name files `test_<module>.py` and target functional paths (audio preprocessing, GPU guardrails, model loader paths).
28
+
29
+ ## Commit & Pull Request Guidelines
30
+ - Repository has no historical git log; use Conventional Commits (`feat:`, `fix:`, `docs:`, `chore:`) for clarity.
31
+ - One topic per commit; keep messages imperative and ≀72 chars in the subject.
32
+ - PRs should include: brief summary of behavior change, commands run (tests or smoke steps), any new dependencies, and before/after screenshots or sample outputs if UI/inference is affected.
33
+ - Avoid committing large model weights or cached downloads; rely on `ModelManager` to fetch at runtime and .gitignore caches.
34
+
35
+ ## Security & Configuration Tips
36
+ - For private models, set `HF_TOKEN` in the environment/Space secrets; do not hardcode secrets.
37
+ - Respect GPU limits in `gpu_manager.py` when adjusting defaults; keep ZeroGPU duration estimates in mind.
38
+ - Large files: keep under repo size limits; store extra assets in external storage or release artifacts.
CLAUDE.md ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ InfiniteTalk is a talking video generator that creates realistic talking head videos with accurate lip-sync. It supports two modes:
8
+ - **Image-to-Video**: Transform static portraits into talking videos using audio input
9
+ - **Video Dubbing**: Re-sync existing videos with new audio while maintaining natural movements
10
+
11
+ Built on the Wan2.1 diffusion model with specialized audio conditioning for photorealistic results.
12
+
13
+ ## Architecture
14
+
15
+ ### Core Components
16
+
17
+ **Main Application** (`app.py`)
18
+ - Gradio interface with ZeroGPU support via `@spaces.GPU(duration=180)` decorator
19
+ - Two-tab interface: Image-to-Video and Video Dubbing
20
+ - Lazy model loading on first inference to minimize startup time
21
+ - Global `ModelManager` and `GPUManager` instances for resource management
22
+
23
+ **Model Pipeline** (`wan/multitalk.py`)
24
+ - `InfiniteTalkPipeline`: Main generation pipeline using Wan2.1-I2V-14B model
25
+ - Supports two resolutions: 480p (640x640) and 720p (960x960)
26
+ - Uses diffusion-based generation with audio conditioning
27
+ - Implements chunked processing for long videos to manage memory
28
+
29
+ **Audio Processing** (`src/audio_analysis/wav2vec2.py`)
30
+ - Custom `Wav2Vec2Model` extending HuggingFace's implementation
31
+ - Extracts audio embeddings with temporal interpolation via `linear_interpolation`
32
+ - Processes audio at 16kHz with loudness normalization (pyloudnorm)
33
+ - Stacks hidden states from all encoder layers for rich audio representation
34
+
35
+ **Model Management** (`utils/model_loader.py`)
36
+ - `ModelManager`: Handles lazy loading and caching of models from HuggingFace Hub
37
+ - Downloads three model types:
38
+ - Wan2.1-I2V-14B: Main video generation model (Kijai/WanVideo_comfy)
39
+ - InfiniteTalk weights: Specialized talking head weights (MeiGen-AI/InfiniteTalk)
40
+ - Wav2Vec2: Audio encoder (TencentGameMate/chinese-wav2vec2-base)
41
+ - Models cached in `HF_HOME` or `/data/.huggingface`
42
+
43
+ **GPU Management** (`utils/gpu_manager.py`)
44
+ - `GPUManager`: Monitors memory usage and performs cleanup
45
+ - Calculates ZeroGPU duration based on video length and resolution
46
+ - Memory estimation: ~20GB base + 0.8GB/s (480p) or 1.5GB/s (720p)
47
+ - Recommends chunking for videos requiring >50GB memory
48
+
49
+ **Configuration** (`wan/configs/__init__.py`)
50
+ - `WAN_CONFIGS`: Model configurations for different tasks (t2v, i2v, infinitetalk)
51
+ - `SIZE_CONFIGS`: Resolution mappings (infinitetalk-480: 640x640, infinitetalk-720: 960x960)
52
+ - `SUPPORTED_SIZES`: Valid resolution options per model type
53
+
54
+ ### Data Flow
55
+
56
+ 1. **Audio Processing**: Audio file β†’ librosa load β†’ loudness normalization β†’ Wav2Vec2 feature extraction β†’ audio embeddings (shape: [seq_len, batch, dim])
57
+ 2. **Input Processing**: Image/video β†’ PIL/cache_video β†’ frame extraction β†’ resize and center crop to target resolution
58
+ 3. **Generation**: InfiniteTalk pipeline combines visual input + audio embeddings β†’ diffusion sampling β†’ video tensor
59
+ 4. **Output**: Video tensor β†’ save_video_ffmpeg with audio track β†’ MP4 file
60
+
61
+ ### Key Design Patterns
62
+
63
+ - **Lazy Loading**: Models only loaded on first inference to reduce cold start time
64
+ - **Memory Management**: Aggressive cleanup with `torch.cuda.empty_cache()` and `gc.collect()` after generation
65
+ - **ZeroGPU Integration**: `@spaces.GPU` decorator with calculated duration based on video length
66
+ - **Offloading**: Models can be offloaded to CPU between forward passes to save VRAM
67
+
68
+ ## Development Commands
69
+
70
+ ### Docker Build and Run
71
+ ```bash
72
+ # Build Docker image
73
+ docker build -t infinitetalk .
74
+
75
+ # Run locally
76
+ docker run -p 7860:7860 --gpus all infinitetalk
77
+ ```
78
+
79
+ ### Python Environment
80
+ ```bash
81
+ # Install dependencies (requires PyTorch 2.5.1+ for xfuser compatibility)
82
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
83
+ pip install flash-attn==2.7.4.post1 --no-build-isolation # Optional, may fail on some systems
84
+ pip install -r requirements.txt
85
+
86
+ # Run application
87
+ python app.py
88
+ ```
89
+
90
+ ### System Dependencies
91
+ Required packages (see `packages.txt`):
92
+ - ffmpeg (video processing)
93
+ - build-essential (compilation)
94
+ - libsndfile1 (audio I/O)
95
+ - git (model downloads)
96
+
97
+ ## Important Implementation Details
98
+
99
+ ### Resolution Handling
100
+ - User selects "480p" or "720p" in UI
101
+ - Internally mapped to `infinitetalk-480` (640x640) or `infinitetalk-720` (960x960)
102
+ - `sample_shift` parameter: 7 for 480p, 11 for 720p (controls diffusion sampling)
103
+
104
+ ### Audio Embedding Format
105
+ Audio embeddings must be saved as `.pt` files in the format expected by the pipeline:
106
+ ```python
107
+ audio_embeddings = torch.stack(embeddings.hidden_states[1:], dim=1).squeeze(0)
108
+ audio_embeddings = rearrange(audio_embeddings, "b s d -> s b d") # Shape: [seq_len, batch, dim]
109
+ torch.save(audio_embeddings, emb_path)
110
+ ```
111
+
112
+ ### Pipeline Input Format
113
+ The `generate_infinitetalk` method expects:
114
+ ```python
115
+ input_clip = {
116
+ "prompt": "", # Empty for talking head
117
+ "cond_video": image_or_video_path,
118
+ "cond_audio": {"person1": embedding_path},
119
+ "video_audio": audio_wav_path
120
+ }
121
+ ```
122
+
123
+ ### ZeroGPU Duration Calculation
124
+ ```python
125
+ base_time = 60 # Model loading
126
+ processing_rate = 2.5 (480p) or 3.5 (720p) # Seconds per video second
127
+ duration = int((base_time + video_duration * processing_rate) * 1.2) # 20% safety margin
128
+ duration = min(duration, 300) # Cap at 300s for free tier
129
+ ```
130
+
131
+ ### Memory Optimization
132
+ - Use `offload_model=True` in pipeline to offload between forwards
133
+ - Enable VRAM management for low-memory scenarios: `pipeline.enable_vram_management()`
134
+ - Flash-attention (if available) reduces memory usage significantly
135
+ - Chunked processing for videos >15s (480p) or >10s (720p)
136
+
137
+ ## HuggingFace Space Deployment
138
+
139
+ This project is designed for HuggingFace Spaces with ZeroGPU:
140
+ - SDK: `docker` (specified in README.md frontmatter)
141
+ - Hardware: `zero-gpu` (H200 with 70GB VRAM)
142
+ - Port: `7860` (Gradio default)
143
+ - First generation downloads ~15GB of models (2-3 minutes)
144
+ - Subsequent generations: ~40s for 10s video at 480p
145
+
146
+ See `DEPLOYMENT.md` for detailed deployment instructions and troubleshooting.
147
+
148
+ ## Common Pitfalls
149
+
150
+ 1. **Flash-attn compilation**: May fail on some systems. The Dockerfile handles this gracefully with `|| echo "Warning..."` fallback
151
+ 2. **PyTorch version**: Must use 2.5.1+ for xfuser's `torch.distributed.tensor.experimental` support
152
+ 3. **Audio sample rate**: Must be 16kHz for Wav2Vec2 model
153
+ 4. **Frame format**: Pipeline expects 4n+1 frames (e.g., 81 frames) for proper temporal modeling
154
+ 5. **Model paths**: InfiniteTalk weights must be loaded separately from base Wan model
155
+ 6. **TOKENIZERS_PARALLELISM**: Set to 'false' to avoid deadlocks in multi-threaded environments
156
+
157
+ ## File Structure
158
+
159
+ ```
160
+ β”œβ”€β”€ app.py # Main Gradio application
161
+ β”œβ”€β”€ Dockerfile # Docker build configuration
162
+ β”œβ”€β”€ requirements.txt # Python dependencies
163
+ β”œβ”€β”€ packages.txt # System dependencies
164
+ β”œβ”€β”€ utils/
165
+ β”‚ β”œβ”€β”€ model_loader.py # Model download and loading
166
+ β”‚ └── gpu_manager.py # GPU memory management
167
+ β”œβ”€β”€ wan/
168
+ β”‚ β”œβ”€β”€ multitalk.py # InfiniteTalk pipeline
169
+ β”‚ β”œβ”€β”€ configs/ # Model configurations
170
+ β”‚ β”œβ”€β”€ modules/ # Model architecture (VAE, DiT, etc.)
171
+ β”‚ └── utils/ # Video/audio utilities
172
+ └── src/
173
+ └── audio_analysis/
174
+ └── wav2vec2.py # Audio encoder with interpolation
175
+ ```
README.md CHANGED
@@ -3,8 +3,10 @@ title: InfiniteTalk - Talking Video Generator
3
  emoji: 🎬
4
  colorFrom: blue
5
  colorTo: purple
6
- sdk: docker
7
- app_port: 7860
 
 
8
  pinned: false
9
  license: apache-2.0
10
  hardware: zero-gpu
 
3
  emoji: 🎬
4
  colorFrom: blue
5
  colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.6.0
8
+ python_version: "3.10"
9
+ app_file: app.py
10
  pinned: false
11
  license: apache-2.0
12
  hardware: zero-gpu
app.py CHANGED
@@ -5,14 +5,17 @@ Gradio Space with ZeroGPU support
5
 
6
  import os
7
  import sys
 
 
 
 
 
 
8
  import random
9
  import logging
10
  import warnings
11
  from pathlib import Path
12
 
13
- # Prevent torchvision from registering optional CUDA/Meta ops (nms) that may be missing on ZeroGPU
14
- os.environ.setdefault("TORCHVISION_DISABLE_META_REGISTRATIONS", "1")
15
-
16
  import gradio as gr
17
  import torch
18
  import numpy as np
 
5
 
6
  import os
7
  import sys
8
+
9
+ # CRITICAL: Set environment variables BEFORE any torch/torchvision imports
10
+ # This prevents torchvision from registering CUDA ops that don't exist on ZeroGPU at import time
11
+ os.environ["TORCHVISION_DISABLE_META_REGISTRATIONS"] = "1"
12
+ os.environ["TORCH_LOGS"] = "-all" # Reduce torch logging noise
13
+
14
  import random
15
  import logging
16
  import warnings
17
  from pathlib import Path
18
 
 
 
 
19
  import gradio as gr
20
  import torch
21
  import numpy as np
requirements.txt CHANGED
@@ -1,19 +1,26 @@
1
- # NOTE: PyTorch and flash-attn are installed via Dockerfile to avoid build issues
2
- # See Dockerfile for proper installation order
 
 
 
3
 
4
- # 1. Core ML libraries
5
- xformers==0.0.28
 
 
 
6
  transformers>=4.49.0
7
  tokenizers>=0.20.3
8
  diffusers>=0.31.0
9
  accelerate>=1.1.1
10
  einops
 
11
 
12
- # 4. Gradio and Spaces
13
  gradio>=5.0.0
14
  spaces
15
 
16
- # 5. Video/Image processing
17
  opencv-python-headless>=4.9.0.80
18
  moviepy==1.0.3
19
  imageio
@@ -21,13 +28,14 @@ imageio-ffmpeg
21
  scikit-image
22
  decord
23
  scenedetect
 
24
 
25
- # 6. Audio processing
26
  librosa
27
  soundfile
28
  pyloudnorm
29
 
30
- # 7. Utilities
31
  tqdm
32
  numpy>=1.23.5,<2
33
  easydict
@@ -35,3 +43,4 @@ ftfy
35
  loguru
36
  optimum-quanto==0.2.6
37
  xfuser>=0.4.1
 
 
1
+ # PyTorch - must be installed first (HuggingFace Spaces handles CUDA)
2
+ --extra-index-url https://download.pytorch.org/whl/cu121
3
+ torch==2.5.1
4
+ torchvision==0.20.1
5
+ torchaudio==2.5.1
6
 
7
+ # Flash attention (optional - may fail on some systems)
8
+ # flash-attn
9
+
10
+ # Core ML libraries
11
+ xformers==0.0.28.post3
12
  transformers>=4.49.0
13
  tokenizers>=0.20.3
14
  diffusers>=0.31.0
15
  accelerate>=1.1.1
16
  einops
17
+ safetensors
18
 
19
+ # Gradio and Spaces
20
  gradio>=5.0.0
21
  spaces
22
 
23
+ # Video/Image processing
24
  opencv-python-headless>=4.9.0.80
25
  moviepy==1.0.3
26
  imageio
 
28
  scikit-image
29
  decord
30
  scenedetect
31
+ pillow
32
 
33
+ # Audio processing
34
  librosa
35
  soundfile
36
  pyloudnorm
37
 
38
+ # Utilities
39
  tqdm
40
  numpy>=1.23.5,<2
41
  easydict
 
43
  loguru
44
  optimum-quanto==0.2.6
45
  xfuser>=0.4.1
46
+ huggingface_hub