Spaces:

ShalomKing
/

infinitetalk

Running

App Files Files Community

ShalomKing commited on 13 days ago

Commit

f076b1f

verified ·

1 Parent(s): 4eab2ed

Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

AGENTS.md +38 -0
CLAUDE.md +175 -0
README.md +4 -2
app.py +6 -3
requirements.txt +17 -8

AGENTS.md ADDED Viewed

	@@ -0,0 +1,38 @@

+# Repository Guidelines
+## Project Structure & Module Organization
+- `app.py`: Gradio entrypoint orchestrating model initialization, audio preprocessing, and inference flow.
+- `src/`: Supporting modules (`audio_analysis/` for wav2vec2 utilities, `vram_management/` for GPU-safe layers, `utils.py` helpers).
+- `utils/`: Infrastructure helpers (`model_loader.py` for WAN/InfiniteTalk weights, `gpu_manager.py` for memory checks/cleanup).
+- `wan/`: Upstream InfiniteTalk model code; treat as vendor code when updating.
+- `assets/` and `examples/`: UI assets and sample media for quick demos; safe to extend.
+- `requirements.txt`, `packages.txt`, `Dockerfile`: Deployment dependencies (note: PyTorch + flash-attn installed via Dockerfile/HF build, not from requirements).
+## Setup, Build, and Local Run
+- Create an isolated env: `python -m venv .venv && source .venv/bin/activate`.
+- Install Python deps: `pip install -r requirements.txt` (PyTorch/flash-attn come from the base image or HuggingFace Space build).
+- Launch UI locally: `python app.py` (Gradio on port 7860 by default).
+- Quick sanity check: `python -m py_compile app.py` to catch syntax errors before pushing.
+- Docker-based run (mirrors HF build): `docker build -t infinitetalk . && docker run -p 7860:7860 infinitetalk`.
+## Coding Style & Naming Conventions
+- Python 3.10+, PEP 8 with 4-space indentation; favor type hints where practical.
+- Functions/variables: `snake_case`; classes: `PascalCase`; constants: `UPPER_SNAKE_CASE`.
+- Prefer `logging` over `print` (consistent with existing modules); keep log level INFO for user-facing runs.
+- Add concise docstrings for public functions; keep module-level comments minimal and purposeful.
+## Testing Guidelines
+- No automated test suite yet; aim to add `pytest`-style tests under `tests/` mirroring `src/` modules.
+- Until then, validate with: (1) `python -m py_compile` for syntax, (2) short inference smoke test using `examples/` media at 480p/30–40 steps.
+- When adding tests, name files `test_<module>.py` and target functional paths (audio preprocessing, GPU guardrails, model loader paths).
+## Commit & Pull Request Guidelines
+- Repository has no historical git log; use Conventional Commits (`feat:`, `fix:`, `docs:`, `chore:`) for clarity.
+- One topic per commit; keep messages imperative and ≤72 chars in the subject.
+- PRs should include: brief summary of behavior change, commands run (tests or smoke steps), any new dependencies, and before/after screenshots or sample outputs if UI/inference is affected.
+- Avoid committing large model weights or cached downloads; rely on `ModelManager` to fetch at runtime and .gitignore caches.
+## Security & Configuration Tips
+- For private models, set `HF_TOKEN` in the environment/Space secrets; do not hardcode secrets.
+- Respect GPU limits in `gpu_manager.py` when adjusting defaults; keep ZeroGPU duration estimates in mind.
+- Large files: keep under repo size limits; store extra assets in external storage or release artifacts.

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,175 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+InfiniteTalk is a talking video generator that creates realistic talking head videos with accurate lip-sync. It supports two modes:
+- **Image-to-Video**: Transform static portraits into talking videos using audio input
+- **Video Dubbing**: Re-sync existing videos with new audio while maintaining natural movements
+Built on the Wan2.1 diffusion model with specialized audio conditioning for photorealistic results.
+## Architecture
+### Core Components
+**Main Application** (`app.py`)
+- Gradio interface with ZeroGPU support via `@spaces.GPU(duration=180)` decorator
+- Two-tab interface: Image-to-Video and Video Dubbing
+- Lazy model loading on first inference to minimize startup time
+- Global `ModelManager` and `GPUManager` instances for resource management
+**Model Pipeline** (`wan/multitalk.py`)
+- `InfiniteTalkPipeline`: Main generation pipeline using Wan2.1-I2V-14B model
+- Supports two resolutions: 480p (640x640) and 720p (960x960)
+- Uses diffusion-based generation with audio conditioning
+- Implements chunked processing for long videos to manage memory
+**Audio Processing** (`src/audio_analysis/wav2vec2.py`)
+- Custom `Wav2Vec2Model` extending HuggingFace's implementation
+- Extracts audio embeddings with temporal interpolation via `linear_interpolation`
+- Processes audio at 16kHz with loudness normalization (pyloudnorm)
+- Stacks hidden states from all encoder layers for rich audio representation
+**Model Management** (`utils/model_loader.py`)
+- `ModelManager`: Handles lazy loading and caching of models from HuggingFace Hub
+- Downloads three model types:
+  - Wan2.1-I2V-14B: Main video generation model (Kijai/WanVideo_comfy)
+  - InfiniteTalk weights: Specialized talking head weights (MeiGen-AI/InfiniteTalk)
+  - Wav2Vec2: Audio encoder (TencentGameMate/chinese-wav2vec2-base)
+- Models cached in `HF_HOME` or `/data/.huggingface`
+**GPU Management** (`utils/gpu_manager.py`)
+- `GPUManager`: Monitors memory usage and performs cleanup
+- Calculates ZeroGPU duration based on video length and resolution
+- Memory estimation: ~20GB base + 0.8GB/s (480p) or 1.5GB/s (720p)
+- Recommends chunking for videos requiring >50GB memory
+**Configuration** (`wan/configs/__init__.py`)
+- `WAN_CONFIGS`: Model configurations for different tasks (t2v, i2v, infinitetalk)
+- `SIZE_CONFIGS`: Resolution mappings (infinitetalk-480: 640x640, infinitetalk-720: 960x960)
+- `SUPPORTED_SIZES`: Valid resolution options per model type
+### Data Flow
+1. **Audio Processing**: Audio file → librosa load → loudness normalization → Wav2Vec2 feature extraction → audio embeddings (shape: [seq_len, batch, dim])
+2. **Input Processing**: Image/video → PIL/cache_video → frame extraction → resize and center crop to target resolution
+3. **Generation**: InfiniteTalk pipeline combines visual input + audio embeddings → diffusion sampling → video tensor
+4. **Output**: Video tensor → save_video_ffmpeg with audio track → MP4 file
+### Key Design Patterns
+- **Lazy Loading**: Models only loaded on first inference to reduce cold start time
+- **Memory Management**: Aggressive cleanup with `torch.cuda.empty_cache()` and `gc.collect()` after generation
+- **ZeroGPU Integration**: `@spaces.GPU` decorator with calculated duration based on video length
+- **Offloading**: Models can be offloaded to CPU between forward passes to save VRAM
+## Development Commands
+### Docker Build and Run
+```bash
+# Build Docker image
+docker build -t infinitetalk .
+# Run locally
+docker run -p 7860:7860 --gpus all infinitetalk
+```
+### Python Environment
+```bash
+# Install dependencies (requires PyTorch 2.5.1+ for xfuser compatibility)
+pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
+pip install flash-attn==2.7.4.post1 --no-build-isolation  # Optional, may fail on some systems
+pip install -r requirements.txt
+# Run application
+python app.py
+```
+### System Dependencies
+Required packages (see `packages.txt`):
+- ffmpeg (video processing)
+- build-essential (compilation)
+- libsndfile1 (audio I/O)
+- git (model downloads)
+## Important Implementation Details
+### Resolution Handling
+- User selects "480p" or "720p" in UI
+- Internally mapped to `infinitetalk-480` (640x640) or `infinitetalk-720` (960x960)
+- `sample_shift` parameter: 7 for 480p, 11 for 720p (controls diffusion sampling)
+### Audio Embedding Format
+Audio embeddings must be saved as `.pt` files in the format expected by the pipeline:
+```python
+audio_embeddings = torch.stack(embeddings.hidden_states[1:], dim=1).squeeze(0)
+audio_embeddings = rearrange(audio_embeddings, "b s d -> s b d")  # Shape: [seq_len, batch, dim]
+torch.save(audio_embeddings, emb_path)
+```
+### Pipeline Input Format
+The `generate_infinitetalk` method expects:
+```python
+input_clip = {
+    "prompt": "",  # Empty for talking head
+    "cond_video": image_or_video_path,
+    "cond_audio": {"person1": embedding_path},
+    "video_audio": audio_wav_path
+}
+```
+### ZeroGPU Duration Calculation
+```python
+base_time = 60  # Model loading
+processing_rate = 2.5 (480p) or 3.5 (720p)  # Seconds per video second
+duration = int((base_time + video_duration * processing_rate) * 1.2)  # 20% safety margin
+duration = min(duration, 300)  # Cap at 300s for free tier
+```
+### Memory Optimization
+- Use `offload_model=True` in pipeline to offload between forwards
+- Enable VRAM management for low-memory scenarios: `pipeline.enable_vram_management()`
+- Flash-attention (if available) reduces memory usage significantly
+- Chunked processing for videos >15s (480p) or >10s (720p)
+## HuggingFace Space Deployment
+This project is designed for HuggingFace Spaces with ZeroGPU:
+- SDK: `docker` (specified in README.md frontmatter)
+- Hardware: `zero-gpu` (H200 with 70GB VRAM)
+- Port: `7860` (Gradio default)
+- First generation downloads ~15GB of models (2-3 minutes)
+- Subsequent generations: ~40s for 10s video at 480p
+See `DEPLOYMENT.md` for detailed deployment instructions and troubleshooting.
+## Common Pitfalls
+1. **Flash-attn compilation**: May fail on some systems. The Dockerfile handles this gracefully with `|| echo "Warning..."` fallback
+2. **PyTorch version**: Must use 2.5.1+ for xfuser's `torch.distributed.tensor.experimental` support
+3. **Audio sample rate**: Must be 16kHz for Wav2Vec2 model
+4. **Frame format**: Pipeline expects 4n+1 frames (e.g., 81 frames) for proper temporal modeling
+5. **Model paths**: InfiniteTalk weights must be loaded separately from base Wan model
+6. **TOKENIZERS_PARALLELISM**: Set to 'false' to avoid deadlocks in multi-threaded environments
+## File Structure
+```
+├── app.py                          # Main Gradio application
+├── Dockerfile                      # Docker build configuration
+├── requirements.txt                # Python dependencies
+├── packages.txt                    # System dependencies
+├── utils/
+│   ├── model_loader.py            # Model download and loading
+│   └── gpu_manager.py             # GPU memory management
+├── wan/
+│   ├── multitalk.py               # InfiniteTalk pipeline
+│   ├── configs/                   # Model configurations
+│   ├── modules/                   # Model architecture (VAE, DiT, etc.)
+│   └── utils/                     # Video/audio utilities
+└── src/
+    └── audio_analysis/
+        └── wav2vec2.py            # Audio encoder with interpolation
+```

README.md CHANGED Viewed

@@ -3,8 +3,10 @@ title: InfiniteTalk - Talking Video Generator
 emoji: 🎬
 colorFrom: blue
 colorTo: purple
-sdk: docker
-app_port: 7860
 pinned: false
 license: apache-2.0
 hardware: zero-gpu

 emoji: 🎬
 colorFrom: blue
 colorTo: purple
+sdk: gradio
+sdk_version: 5.6.0
+python_version: "3.10"
+app_file: app.py
 pinned: false
 license: apache-2.0
 hardware: zero-gpu

app.py CHANGED Viewed

@@ -5,14 +5,17 @@ Gradio Space with ZeroGPU support
 import os
 import sys
 import random
 import logging
 import warnings
 from pathlib import Path
-# Prevent torchvision from registering optional CUDA/Meta ops (nms) that may be missing on ZeroGPU
-os.environ.setdefault("TORCHVISION_DISABLE_META_REGISTRATIONS", "1")
 import gradio as gr
 import torch
 import numpy as np

 import os
 import sys
+# CRITICAL: Set environment variables BEFORE any torch/torchvision imports
+# This prevents torchvision from registering CUDA ops that don't exist on ZeroGPU at import time
+os.environ["TORCHVISION_DISABLE_META_REGISTRATIONS"] = "1"
+os.environ["TORCH_LOGS"] = "-all"  # Reduce torch logging noise
 import random
 import logging
 import warnings
 from pathlib import Path
 import gradio as gr
 import torch
 import numpy as np

requirements.txt CHANGED Viewed

@@ -1,19 +1,26 @@
-# NOTE: PyTorch and flash-attn are installed via Dockerfile to avoid build issues
-# See Dockerfile for proper installation order
-# 1. Core ML libraries
-xformers==0.0.28
 transformers>=4.49.0
 tokenizers>=0.20.3
 diffusers>=0.31.0
 accelerate>=1.1.1
 einops
-# 4. Gradio and Spaces
 gradio>=5.0.0
 spaces
-# 5. Video/Image processing
 opencv-python-headless>=4.9.0.80
 moviepy==1.0.3
 imageio
@@ -21,13 +28,14 @@ imageio-ffmpeg
 scikit-image
 decord
 scenedetect
-# 6. Audio processing
 librosa
 soundfile
 pyloudnorm
-# 7. Utilities
 tqdm
 numpy>=1.23.5,<2
 easydict
@@ -35,3 +43,4 @@ ftfy
 loguru
 optimum-quanto==0.2.6
 xfuser>=0.4.1

+# PyTorch - must be installed first (HuggingFace Spaces handles CUDA)
+--extra-index-url https://download.pytorch.org/whl/cu121
+torch==2.5.1
+torchvision==0.20.1
+torchaudio==2.5.1
+# Flash attention (optional - may fail on some systems)
+# flash-attn
+# Core ML libraries
+xformers==0.0.28.post3
 transformers>=4.49.0
 tokenizers>=0.20.3
 diffusers>=0.31.0
 accelerate>=1.1.1
 einops
+safetensors
+# Gradio and Spaces
 gradio>=5.0.0
 spaces
+# Video/Image processing
 opencv-python-headless>=4.9.0.80
 moviepy==1.0.3
 imageio
 scikit-image
 decord
 scenedetect
+pillow
+# Audio processing
 librosa
 soundfile
 pyloudnorm
+# Utilities
 tqdm
 numpy>=1.23.5,<2
 easydict
 loguru
 optimum-quanto==0.2.6
 xfuser>=0.4.1
+huggingface_hub