Spaces:
Running
A newer version of the Gradio SDK is available:
6.1.0
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
InfiniteTalk is a talking video generator that creates realistic talking head videos with accurate lip-sync. It supports two modes:
- Image-to-Video: Transform static portraits into talking videos using audio input
- Video Dubbing: Re-sync existing videos with new audio while maintaining natural movements
Built on the Wan2.1 diffusion model with specialized audio conditioning for photorealistic results.
Architecture
Core Components
Main Application (app.py)
- Gradio interface with ZeroGPU support via
@spaces.GPU(duration=180)decorator - Two-tab interface: Image-to-Video and Video Dubbing
- Lazy model loading on first inference to minimize startup time
- Global
ModelManagerandGPUManagerinstances for resource management
Model Pipeline (wan/multitalk.py)
InfiniteTalkPipeline: Main generation pipeline using Wan2.1-I2V-14B model- Supports two resolutions: 480p (640x640) and 720p (960x960)
- Uses diffusion-based generation with audio conditioning
- Implements chunked processing for long videos to manage memory
Audio Processing (src/audio_analysis/wav2vec2.py)
- Custom
Wav2Vec2Modelextending HuggingFace's implementation - Extracts audio embeddings with temporal interpolation via
linear_interpolation - Processes audio at 16kHz with loudness normalization (pyloudnorm)
- Stacks hidden states from all encoder layers for rich audio representation
Model Management (utils/model_loader.py)
ModelManager: Handles lazy loading and caching of models from HuggingFace Hub- Downloads three model types:
- Wan2.1-I2V-14B: Main video generation model (Kijai/WanVideo_comfy)
- InfiniteTalk weights: Specialized talking head weights (MeiGen-AI/InfiniteTalk)
- Wav2Vec2: Audio encoder (TencentGameMate/chinese-wav2vec2-base)
- Models cached in
HF_HOMEor/data/.huggingface
GPU Management (utils/gpu_manager.py)
GPUManager: Monitors memory usage and performs cleanup- Calculates ZeroGPU duration based on video length and resolution
- Memory estimation: ~20GB base + 0.8GB/s (480p) or 1.5GB/s (720p)
- Recommends chunking for videos requiring >50GB memory
Configuration (wan/configs/__init__.py)
WAN_CONFIGS: Model configurations for different tasks (t2v, i2v, infinitetalk)SIZE_CONFIGS: Resolution mappings (infinitetalk-480: 640x640, infinitetalk-720: 960x960)SUPPORTED_SIZES: Valid resolution options per model type
Data Flow
- Audio Processing: Audio file β librosa load β loudness normalization β Wav2Vec2 feature extraction β audio embeddings (shape: [seq_len, batch, dim])
- Input Processing: Image/video β PIL/cache_video β frame extraction β resize and center crop to target resolution
- Generation: InfiniteTalk pipeline combines visual input + audio embeddings β diffusion sampling β video tensor
- Output: Video tensor β save_video_ffmpeg with audio track β MP4 file
Key Design Patterns
- Lazy Loading: Models only loaded on first inference to reduce cold start time
- Memory Management: Aggressive cleanup with
torch.cuda.empty_cache()andgc.collect()after generation - ZeroGPU Integration:
@spaces.GPUdecorator with calculated duration based on video length - Offloading: Models can be offloaded to CPU between forward passes to save VRAM
Development Commands
Docker Build and Run
# Build Docker image
docker build -t infinitetalk .
# Run locally
docker run -p 7860:7860 --gpus all infinitetalk
Python Environment
# Install dependencies (requires PyTorch 2.5.1+ for xfuser compatibility)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install flash-attn==2.7.4.post1 --no-build-isolation # Optional, may fail on some systems
pip install -r requirements.txt
# Run application
python app.py
System Dependencies
Required packages (see packages.txt):
- ffmpeg (video processing)
- build-essential (compilation)
- libsndfile1 (audio I/O)
- git (model downloads)
Important Implementation Details
Resolution Handling
- User selects "480p" or "720p" in UI
- Internally mapped to
infinitetalk-480(640x640) orinfinitetalk-720(960x960) sample_shiftparameter: 7 for 480p, 11 for 720p (controls diffusion sampling)
Audio Embedding Format
Audio embeddings must be saved as .pt files in the format expected by the pipeline:
audio_embeddings = torch.stack(embeddings.hidden_states[1:], dim=1).squeeze(0)
audio_embeddings = rearrange(audio_embeddings, "b s d -> s b d") # Shape: [seq_len, batch, dim]
torch.save(audio_embeddings, emb_path)
Pipeline Input Format
The generate_infinitetalk method expects:
input_clip = {
"prompt": "", # Empty for talking head
"cond_video": image_or_video_path,
"cond_audio": {"person1": embedding_path},
"video_audio": audio_wav_path
}
ZeroGPU Duration Calculation
base_time = 60 # Model loading
processing_rate = 2.5 (480p) or 3.5 (720p) # Seconds per video second
duration = int((base_time + video_duration * processing_rate) * 1.2) # 20% safety margin
duration = min(duration, 300) # Cap at 300s for free tier
Memory Optimization
- Use
offload_model=Truein pipeline to offload between forwards - Enable VRAM management for low-memory scenarios:
pipeline.enable_vram_management() - Flash-attention (if available) reduces memory usage significantly
- Chunked processing for videos >15s (480p) or >10s (720p)
HuggingFace Space Deployment
This project is designed for HuggingFace Spaces with ZeroGPU:
- SDK:
docker(specified in README.md frontmatter) - Hardware:
zero-gpu(H200 with 70GB VRAM) - Port:
7860(Gradio default) - First generation downloads ~15GB of models (2-3 minutes)
- Subsequent generations: ~40s for 10s video at 480p
See DEPLOYMENT.md for detailed deployment instructions and troubleshooting.
Common Pitfalls
- Flash-attn compilation: May fail on some systems. The Dockerfile handles this gracefully with
|| echo "Warning..."fallback - PyTorch version: Must use 2.5.1+ for xfuser's
torch.distributed.tensor.experimentalsupport - Audio sample rate: Must be 16kHz for Wav2Vec2 model
- Frame format: Pipeline expects 4n+1 frames (e.g., 81 frames) for proper temporal modeling
- Model paths: InfiniteTalk weights must be loaded separately from base Wan model
- TOKENIZERS_PARALLELISM: Set to 'false' to avoid deadlocks in multi-threaded environments
File Structure
βββ app.py # Main Gradio application
βββ Dockerfile # Docker build configuration
βββ requirements.txt # Python dependencies
βββ packages.txt # System dependencies
βββ utils/
β βββ model_loader.py # Model download and loading
β βββ gpu_manager.py # GPU memory management
βββ wan/
β βββ multitalk.py # InfiniteTalk pipeline
β βββ configs/ # Model configurations
β βββ modules/ # Model architecture (VAE, DiT, etc.)
β βββ utils/ # Video/audio utilities
βββ src/
βββ audio_analysis/
βββ wav2vec2.py # Audio encoder with interpolation