Zenderos / SETUP.md
ASADSANAN's picture
Upload 11 files
3d8856d verified

TTV-1B Setup Guide

Complete installation and setup instructions for the TTV-1B text-to-video model.

Prerequisites

Hardware Requirements

Minimum (Inference Only)

  • GPU: 8GB VRAM (RTX 3070, RTX 4060 Ti)
  • RAM: 16GB
  • Storage: 50GB
  • OS: Ubuntu 20.04+, Windows 10+, macOS 12+

Recommended (Training)

  • GPU: 24GB+ VRAM (RTX 4090, A5000, A100)
  • RAM: 64GB
  • Storage: 500GB SSD
  • OS: Ubuntu 22.04 LTS

Production (Full Training)

  • GPU: 8Γ— A100 80GB
  • RAM: 512GB
  • Storage: 2TB NVMe SSD
  • Network: High-speed interconnect for multi-GPU

Software Requirements

  • Python 3.9, 3.10, or 3.11
  • CUDA 11.8+ (for GPU acceleration)
  • cuDNN 8.6+
  • Git

Installation

Step 1: Clone Repository

git clone https://github.com/yourusername/ttv-1b.git
cd ttv-1b

Step 2: Create Virtual Environment

# Using venv
python3 -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate  # Windows

# Using conda (alternative)
conda create -n ttv1b python=3.10
conda activate ttv1b

Step 3: Install PyTorch

Choose the appropriate command for your system from https://pytorch.org/get-started/locally/

# CUDA 11.8 (most common)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# CUDA 12.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# CPU only (not recommended)
pip install torch torchvision

Step 4: Install Dependencies

pip install -r requirements.txt

Step 5: Verify Installation

python -c "import torch; print(f'PyTorch {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"

Expected output:

PyTorch 2.1.0
CUDA available: True

Quick Start

Test the Model

# Run evaluation script to verify everything works
python evaluate.py

This will:

  • Create the model
  • Count parameters (should be ~1.0B)
  • Test forward/backward passes
  • Measure inference speed
  • Check memory usage

Generate Your First Video (After Training)

python inference.py \
    --prompt "A beautiful sunset over mountains" \
    --checkpoint checkpoints/checkpoint_best.pt \
    --output my_first_video.mp4 \
    --steps 50

Preparing Data

Data Format

The model expects video-text pairs in the following format:

data/
β”œβ”€β”€ videos/
β”‚   β”œβ”€β”€ video_0001.mp4
β”‚   β”œβ”€β”€ video_0002.mp4
β”‚   └── ...
└── annotations.json

annotations.json:

{
  "video_0001": {
    "caption": "A cat playing with a ball of yarn",
    "duration": 2.0,
    "fps": 8
  },
  "video_0002": {
    "caption": "Sunset over the ocean with waves",
    "duration": 2.0,
    "fps": 8
  }
}

Video Specifications

  • Format: MP4, AVI, or MOV
  • Resolution: 256Γ—256 (will be resized)
  • Frame rate: 8 FPS recommended
  • Duration: 2 seconds (16 frames at 8 FPS)
  • Codec: H.264 recommended

Converting Videos

# Using FFmpeg to convert videos
ffmpeg -i input.mp4 -vf "scale=256:256,fps=8" -t 2 -c:v libx264 output.mp4

Dataset Preparation Script

import json
from pathlib import Path

def create_annotations(video_dir, output_file):
    """Create annotations file from videos"""
    video_dir = Path(video_dir)
    annotations = {}
    
    for video_path in video_dir.glob("*.mp4"):
        video_id = video_path.stem
        annotations[video_id] = {
            "caption": f"Video {video_id}",  # Add actual captions
            "duration": 2.0,
            "fps": 8
        }
    
    with open(output_file, 'w') as f:
        json.dump(annotations, f, indent=2)

# Usage
create_annotations("data/videos", "data/annotations.json")

Training

Single GPU Training

python train.py

Configuration in train.py:

config = {
    'batch_size': 2,
    'gradient_accumulation_steps': 8,  # Effective batch size = 16
    'learning_rate': 1e-4,
    'num_epochs': 100,
    'mixed_precision': True,
}

Multi-GPU Training (Recommended)

# Using PyTorch DDP
torchrun --nproc_per_node=8 train.py

# Or using accelerate (better)
accelerate config  # First time setup
accelerate launch train.py

Monitoring Training

# Install tensorboard
pip install tensorboard

# Run tensorboard
tensorboard --logdir=./checkpoints/logs

Resume from Checkpoint

# In train.py, add:
trainer.load_checkpoint('checkpoints/checkpoint_step_10000.pt')
trainer.train()

Inference

Basic Inference

from inference import generate_video_from_prompt

video = generate_video_from_prompt(
    prompt="A serene lake with mountains",
    checkpoint_path="checkpoints/best.pt",
    output_path="output.mp4",
    num_steps=50,
    guidance_scale=7.5,
    seed=42  # For reproducibility
)

Batch Inference

from inference import batch_generate

prompts = [
    "A cat playing",
    "Ocean waves",
    "City at night"
]

batch_generate(
    prompts=prompts,
    checkpoint_path="checkpoints/best.pt",
    output_dir="./outputs",
    num_steps=50
)

Advanced Options

# Lower guidance for more creative results
video = generate_video_from_prompt(
    prompt="Abstract art in motion",
    guidance_scale=5.0,  # Lower = more creative
    num_steps=100,        # More steps = higher quality
)

# Fast generation (fewer steps)
video = generate_video_from_prompt(
    prompt="Quick test",
    num_steps=20,  # Faster but lower quality
)

Optimization Tips

Memory Optimization

  1. Reduce Batch Size
config['batch_size'] = 1  # Minimum
config['gradient_accumulation_steps'] = 16  # Maintain effective batch size
  1. Enable Gradient Checkpointing
config['gradient_checkpointing'] = True
  1. Use Mixed Precision
config['mixed_precision'] = True  # Always recommended

Speed Optimization

  1. Use Torch Compile (PyTorch 2.0+)
model = torch.compile(model)
  1. Enable cuDNN Benchmarking
torch.backends.cudnn.benchmark = True
  1. Pin Memory
DataLoader(..., pin_memory=True)

Troubleshooting

CUDA Out of Memory

# Reduce batch size
config['batch_size'] = 1

# Enable gradient checkpointing
config['gradient_checkpointing'] = True

# Clear cache
torch.cuda.empty_cache()

Slow Training

# Check GPU utilization
nvidia-smi

# Increase num_workers
DataLoader(..., num_workers=8)

# Enable mixed precision
config['mixed_precision'] = True

NaN Loss

# Reduce learning rate
config['learning_rate'] = 5e-5

# Enable gradient clipping (already included)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

# Check for NaN in data
assert not torch.isnan(videos).any()

Model Not Learning

# Increase learning rate
config['learning_rate'] = 2e-4

# Check data quality
# Verify annotations are correct
# Ensure videos are properly normalized

# Reduce regularization
config['weight_decay'] = 0.001  # Lower weight decay

Performance Benchmarks

Training Speed (A100 80GB)

Batch Size Grad Accum Eff. Batch Sec/Batch Hours/100K steps
1 16 16 2.5 69
2 8 16 2.5 69
4 4 16 2.7 75

Inference Speed

GPU FP16 Steps Time/Video
A100 80GB Yes 50 15s
RTX 4090 Yes 50 25s
RTX 3090 Yes 50 35s

Memory Usage

Operation Batch Size Memory (GB)
Inference 1 6
Training 1 12
Training 2 24
Training 4 48

Next Steps

  1. Prepare your dataset - Collect and annotate videos
  2. Start training - Begin with small dataset to verify
  3. Monitor progress - Check loss, sample generations
  4. Fine-tune - Adjust hyperparameters based on results
  5. Evaluate - Test on held-out validation set
  6. Deploy - Use for inference on new prompts

Getting Help

  • GitHub Issues: Report bugs and ask questions
  • Documentation: Check README.md and ARCHITECTURE.md
  • Examples: See example scripts in the repository

Additional Resources