Spaces:

ShalomKing
/

infinitetalk

Running

App Files Files Community

infinitetalk / CLAUDE.md

ShalomKing

Upload folder using huggingface_hub

f076b1f verified 13 days ago

preview code

raw

history blame contribute delete

7.58 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

InfiniteTalk is a talking video generator that creates realistic talking head videos with accurate lip-sync. It supports two modes:

Image-to-Video: Transform static portraits into talking videos using audio input
Video Dubbing: Re-sync existing videos with new audio while maintaining natural movements

Built on the Wan2.1 diffusion model with specialized audio conditioning for photorealistic results.

Architecture

Core Components

Main Application (app.py)

Gradio interface with ZeroGPU support via @spaces.GPU(duration=180) decorator
Two-tab interface: Image-to-Video and Video Dubbing
Lazy model loading on first inference to minimize startup time
Global ModelManager and GPUManager instances for resource management

Model Pipeline (wan/multitalk.py)

InfiniteTalkPipeline: Main generation pipeline using Wan2.1-I2V-14B model
Supports two resolutions: 480p (640x640) and 720p (960x960)
Uses diffusion-based generation with audio conditioning
Implements chunked processing for long videos to manage memory

Audio Processing (src/audio_analysis/wav2vec2.py)

Custom Wav2Vec2Model extending HuggingFace's implementation
Extracts audio embeddings with temporal interpolation via linear_interpolation
Processes audio at 16kHz with loudness normalization (pyloudnorm)
Stacks hidden states from all encoder layers for rich audio representation

Model Management (utils/model_loader.py)

ModelManager: Handles lazy loading and caching of models from HuggingFace Hub
Downloads three model types:
- Wan2.1-I2V-14B: Main video generation model (Kijai/WanVideo_comfy)
- InfiniteTalk weights: Specialized talking head weights (MeiGen-AI/InfiniteTalk)
- Wav2Vec2: Audio encoder (TencentGameMate/chinese-wav2vec2-base)
Models cached in HF_HOME or /data/.huggingface

GPU Management (utils/gpu_manager.py)

GPUManager: Monitors memory usage and performs cleanup
Calculates ZeroGPU duration based on video length and resolution
Memory estimation: ~20GB base + 0.8GB/s (480p) or 1.5GB/s (720p)
Recommends chunking for videos requiring >50GB memory

Configuration (wan/configs/__init__.py)

WAN_CONFIGS: Model configurations for different tasks (t2v, i2v, infinitetalk)
SIZE_CONFIGS: Resolution mappings (infinitetalk-480: 640x640, infinitetalk-720: 960x960)
SUPPORTED_SIZES: Valid resolution options per model type

Data Flow

Audio Processing: Audio file → librosa load → loudness normalization → Wav2Vec2 feature extraction → audio embeddings (shape: [seq_len, batch, dim])
Input Processing: Image/video → PIL/cache_video → frame extraction → resize and center crop to target resolution
Generation: InfiniteTalk pipeline combines visual input + audio embeddings → diffusion sampling → video tensor
Output: Video tensor → save_video_ffmpeg with audio track → MP4 file

Key Design Patterns

Lazy Loading: Models only loaded on first inference to reduce cold start time
Memory Management: Aggressive cleanup with torch.cuda.empty_cache() and gc.collect() after generation
ZeroGPU Integration: @spaces.GPU decorator with calculated duration based on video length
Offloading: Models can be offloaded to CPU between forward passes to save VRAM

Development Commands

Docker Build and Run

# Build Docker image
docker build -t infinitetalk .

# Run locally
docker run -p 7860:7860 --gpus all infinitetalk

Python Environment

# Install dependencies (requires PyTorch 2.5.1+ for xfuser compatibility)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install flash-attn==2.7.4.post1 --no-build-isolation  # Optional, may fail on some systems
pip install -r requirements.txt

# Run application
python app.py

System Dependencies

Required packages (see packages.txt):

ffmpeg (video processing)
build-essential (compilation)
libsndfile1 (audio I/O)
git (model downloads)

Important Implementation Details

Resolution Handling

User selects "480p" or "720p" in UI
Internally mapped to infinitetalk-480 (640x640) or infinitetalk-720 (960x960)
sample_shift parameter: 7 for 480p, 11 for 720p (controls diffusion sampling)

Audio Embedding Format

Audio embeddings must be saved as .pt files in the format expected by the pipeline:

audio_embeddings = torch.stack(embeddings.hidden_states[1:], dim=1).squeeze(0)
audio_embeddings = rearrange(audio_embeddings, "b s d -> s b d")  # Shape: [seq_len, batch, dim]
torch.save(audio_embeddings, emb_path)

Pipeline Input Format

The generate_infinitetalk method expects:

input_clip = {
    "prompt": "",  # Empty for talking head
    "cond_video": image_or_video_path,
    "cond_audio": {"person1": embedding_path},
    "video_audio": audio_wav_path
}

ZeroGPU Duration Calculation

base_time = 60  # Model loading
processing_rate = 2.5 (480p) or 3.5 (720p)  # Seconds per video second
duration = int((base_time + video_duration * processing_rate) * 1.2)  # 20% safety margin
duration = min(duration, 300)  # Cap at 300s for free tier

Memory Optimization

Use offload_model=True in pipeline to offload between forwards
Enable VRAM management for low-memory scenarios: pipeline.enable_vram_management()
Flash-attention (if available) reduces memory usage significantly
Chunked processing for videos >15s (480p) or >10s (720p)

HuggingFace Space Deployment

This project is designed for HuggingFace Spaces with ZeroGPU:

SDK: docker (specified in README.md frontmatter)
Hardware: zero-gpu (H200 with 70GB VRAM)
Port: 7860 (Gradio default)
First generation downloads ~15GB of models (2-3 minutes)
Subsequent generations: ~40s for 10s video at 480p

See DEPLOYMENT.md for detailed deployment instructions and troubleshooting.

Common Pitfalls

Flash-attn compilation: May fail on some systems. The Dockerfile handles this gracefully with || echo "Warning..." fallback
PyTorch version: Must use 2.5.1+ for xfuser's torch.distributed.tensor.experimental support
Audio sample rate: Must be 16kHz for Wav2Vec2 model
Frame format: Pipeline expects 4n+1 frames (e.g., 81 frames) for proper temporal modeling
Model paths: InfiniteTalk weights must be loaded separately from base Wan model
TOKENIZERS_PARALLELISM: Set to 'false' to avoid deadlocks in multi-threaded environments

File Structure

├── app.py                          # Main Gradio application
├── Dockerfile                      # Docker build configuration
├── requirements.txt                # Python dependencies
├── packages.txt                    # System dependencies
├── utils/
│   ├── model_loader.py            # Model download and loading
│   └── gpu_manager.py             # GPU memory management
├── wan/
│   ├── multitalk.py               # InfiniteTalk pipeline
│   ├── configs/                   # Model configurations
│   ├── modules/                   # Model architecture (VAE, DiT, etc.)
│   └── utils/                     # Video/audio utilities
└── src/
    └── audio_analysis/
        └── wav2vec2.py            # Audio encoder with interpolation