infinitetalk / CLAUDE.md
ShalomKing's picture
Upload folder using huggingface_hub
f076b1f verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

InfiniteTalk is a talking video generator that creates realistic talking head videos with accurate lip-sync. It supports two modes:

  • Image-to-Video: Transform static portraits into talking videos using audio input
  • Video Dubbing: Re-sync existing videos with new audio while maintaining natural movements

Built on the Wan2.1 diffusion model with specialized audio conditioning for photorealistic results.

Architecture

Core Components

Main Application (app.py)

  • Gradio interface with ZeroGPU support via @spaces.GPU(duration=180) decorator
  • Two-tab interface: Image-to-Video and Video Dubbing
  • Lazy model loading on first inference to minimize startup time
  • Global ModelManager and GPUManager instances for resource management

Model Pipeline (wan/multitalk.py)

  • InfiniteTalkPipeline: Main generation pipeline using Wan2.1-I2V-14B model
  • Supports two resolutions: 480p (640x640) and 720p (960x960)
  • Uses diffusion-based generation with audio conditioning
  • Implements chunked processing for long videos to manage memory

Audio Processing (src/audio_analysis/wav2vec2.py)

  • Custom Wav2Vec2Model extending HuggingFace's implementation
  • Extracts audio embeddings with temporal interpolation via linear_interpolation
  • Processes audio at 16kHz with loudness normalization (pyloudnorm)
  • Stacks hidden states from all encoder layers for rich audio representation

Model Management (utils/model_loader.py)

  • ModelManager: Handles lazy loading and caching of models from HuggingFace Hub
  • Downloads three model types:
    • Wan2.1-I2V-14B: Main video generation model (Kijai/WanVideo_comfy)
    • InfiniteTalk weights: Specialized talking head weights (MeiGen-AI/InfiniteTalk)
    • Wav2Vec2: Audio encoder (TencentGameMate/chinese-wav2vec2-base)
  • Models cached in HF_HOME or /data/.huggingface

GPU Management (utils/gpu_manager.py)

  • GPUManager: Monitors memory usage and performs cleanup
  • Calculates ZeroGPU duration based on video length and resolution
  • Memory estimation: ~20GB base + 0.8GB/s (480p) or 1.5GB/s (720p)
  • Recommends chunking for videos requiring >50GB memory

Configuration (wan/configs/__init__.py)

  • WAN_CONFIGS: Model configurations for different tasks (t2v, i2v, infinitetalk)
  • SIZE_CONFIGS: Resolution mappings (infinitetalk-480: 640x640, infinitetalk-720: 960x960)
  • SUPPORTED_SIZES: Valid resolution options per model type

Data Flow

  1. Audio Processing: Audio file β†’ librosa load β†’ loudness normalization β†’ Wav2Vec2 feature extraction β†’ audio embeddings (shape: [seq_len, batch, dim])
  2. Input Processing: Image/video β†’ PIL/cache_video β†’ frame extraction β†’ resize and center crop to target resolution
  3. Generation: InfiniteTalk pipeline combines visual input + audio embeddings β†’ diffusion sampling β†’ video tensor
  4. Output: Video tensor β†’ save_video_ffmpeg with audio track β†’ MP4 file

Key Design Patterns

  • Lazy Loading: Models only loaded on first inference to reduce cold start time
  • Memory Management: Aggressive cleanup with torch.cuda.empty_cache() and gc.collect() after generation
  • ZeroGPU Integration: @spaces.GPU decorator with calculated duration based on video length
  • Offloading: Models can be offloaded to CPU between forward passes to save VRAM

Development Commands

Docker Build and Run

# Build Docker image
docker build -t infinitetalk .

# Run locally
docker run -p 7860:7860 --gpus all infinitetalk

Python Environment

# Install dependencies (requires PyTorch 2.5.1+ for xfuser compatibility)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install flash-attn==2.7.4.post1 --no-build-isolation  # Optional, may fail on some systems
pip install -r requirements.txt

# Run application
python app.py

System Dependencies

Required packages (see packages.txt):

  • ffmpeg (video processing)
  • build-essential (compilation)
  • libsndfile1 (audio I/O)
  • git (model downloads)

Important Implementation Details

Resolution Handling

  • User selects "480p" or "720p" in UI
  • Internally mapped to infinitetalk-480 (640x640) or infinitetalk-720 (960x960)
  • sample_shift parameter: 7 for 480p, 11 for 720p (controls diffusion sampling)

Audio Embedding Format

Audio embeddings must be saved as .pt files in the format expected by the pipeline:

audio_embeddings = torch.stack(embeddings.hidden_states[1:], dim=1).squeeze(0)
audio_embeddings = rearrange(audio_embeddings, "b s d -> s b d")  # Shape: [seq_len, batch, dim]
torch.save(audio_embeddings, emb_path)

Pipeline Input Format

The generate_infinitetalk method expects:

input_clip = {
    "prompt": "",  # Empty for talking head
    "cond_video": image_or_video_path,
    "cond_audio": {"person1": embedding_path},
    "video_audio": audio_wav_path
}

ZeroGPU Duration Calculation

base_time = 60  # Model loading
processing_rate = 2.5 (480p) or 3.5 (720p)  # Seconds per video second
duration = int((base_time + video_duration * processing_rate) * 1.2)  # 20% safety margin
duration = min(duration, 300)  # Cap at 300s for free tier

Memory Optimization

  • Use offload_model=True in pipeline to offload between forwards
  • Enable VRAM management for low-memory scenarios: pipeline.enable_vram_management()
  • Flash-attention (if available) reduces memory usage significantly
  • Chunked processing for videos >15s (480p) or >10s (720p)

HuggingFace Space Deployment

This project is designed for HuggingFace Spaces with ZeroGPU:

  • SDK: docker (specified in README.md frontmatter)
  • Hardware: zero-gpu (H200 with 70GB VRAM)
  • Port: 7860 (Gradio default)
  • First generation downloads ~15GB of models (2-3 minutes)
  • Subsequent generations: ~40s for 10s video at 480p

See DEPLOYMENT.md for detailed deployment instructions and troubleshooting.

Common Pitfalls

  1. Flash-attn compilation: May fail on some systems. The Dockerfile handles this gracefully with || echo "Warning..." fallback
  2. PyTorch version: Must use 2.5.1+ for xfuser's torch.distributed.tensor.experimental support
  3. Audio sample rate: Must be 16kHz for Wav2Vec2 model
  4. Frame format: Pipeline expects 4n+1 frames (e.g., 81 frames) for proper temporal modeling
  5. Model paths: InfiniteTalk weights must be loaded separately from base Wan model
  6. TOKENIZERS_PARALLELISM: Set to 'false' to avoid deadlocks in multi-threaded environments

File Structure

β”œβ”€β”€ app.py                          # Main Gradio application
β”œβ”€β”€ Dockerfile                      # Docker build configuration
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ packages.txt                    # System dependencies
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ model_loader.py            # Model download and loading
β”‚   └── gpu_manager.py             # GPU memory management
β”œβ”€β”€ wan/
β”‚   β”œβ”€β”€ multitalk.py               # InfiniteTalk pipeline
β”‚   β”œβ”€β”€ configs/                   # Model configurations
β”‚   β”œβ”€β”€ modules/                   # Model architecture (VAE, DiT, etc.)
β”‚   └── utils/                     # Video/audio utilities
└── src/
    └── audio_analysis/
        └── wav2vec2.py            # Audio encoder with interpolation