# TTV-1B Setup Guide Complete installation and setup instructions for the TTV-1B text-to-video model. ## Prerequisites ### Hardware Requirements #### Minimum (Inference Only) - GPU: 8GB VRAM (RTX 3070, RTX 4060 Ti) - RAM: 16GB - Storage: 50GB - OS: Ubuntu 20.04+, Windows 10+, macOS 12+ #### Recommended (Training) - GPU: 24GB+ VRAM (RTX 4090, A5000, A100) - RAM: 64GB - Storage: 500GB SSD - OS: Ubuntu 22.04 LTS #### Production (Full Training) - GPU: 8× A100 80GB - RAM: 512GB - Storage: 2TB NVMe SSD - Network: High-speed interconnect for multi-GPU ### Software Requirements - Python 3.9, 3.10, or 3.11 - CUDA 11.8+ (for GPU acceleration) - cuDNN 8.6+ - Git ## Installation ### Step 1: Clone Repository ```bash git clone https://github.com/yourusername/ttv-1b.git cd ttv-1b ``` ### Step 2: Create Virtual Environment ```bash # Using venv python3 -m venv venv source venv/bin/activate # Linux/Mac # or venv\Scripts\activate # Windows # Using conda (alternative) conda create -n ttv1b python=3.10 conda activate ttv1b ``` ### Step 3: Install PyTorch Choose the appropriate command for your system from https://pytorch.org/get-started/locally/ ```bash # CUDA 11.8 (most common) pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 # CUDA 12.1 pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121 # CPU only (not recommended) pip install torch torchvision ``` ### Step 4: Install Dependencies ```bash pip install -r requirements.txt ``` ### Step 5: Verify Installation ```bash python -c "import torch; print(f'PyTorch {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')" ``` Expected output: ``` PyTorch 2.1.0 CUDA available: True ``` ## Quick Start ### Test the Model ```bash # Run evaluation script to verify everything works python evaluate.py ``` This will: - Create the model - Count parameters (should be ~1.0B) - Test forward/backward passes - Measure inference speed - Check memory usage ### Generate Your First Video (After Training) ```bash python inference.py \ --prompt "A beautiful sunset over mountains" \ --checkpoint checkpoints/checkpoint_best.pt \ --output my_first_video.mp4 \ --steps 50 ``` ## Preparing Data ### Data Format The model expects video-text pairs in the following format: ``` data/ ├── videos/ │ ├── video_0001.mp4 │ ├── video_0002.mp4 │ └── ... └── annotations.json ``` annotations.json: ```json { "video_0001": { "caption": "A cat playing with a ball of yarn", "duration": 2.0, "fps": 8 }, "video_0002": { "caption": "Sunset over the ocean with waves", "duration": 2.0, "fps": 8 } } ``` ### Video Specifications - Format: MP4, AVI, or MOV - Resolution: 256×256 (will be resized) - Frame rate: 8 FPS recommended - Duration: 2 seconds (16 frames at 8 FPS) - Codec: H.264 recommended ### Converting Videos ```bash # Using FFmpeg to convert videos ffmpeg -i input.mp4 -vf "scale=256:256,fps=8" -t 2 -c:v libx264 output.mp4 ``` ### Dataset Preparation Script ```python import json from pathlib import Path def create_annotations(video_dir, output_file): """Create annotations file from videos""" video_dir = Path(video_dir) annotations = {} for video_path in video_dir.glob("*.mp4"): video_id = video_path.stem annotations[video_id] = { "caption": f"Video {video_id}", # Add actual captions "duration": 2.0, "fps": 8 } with open(output_file, 'w') as f: json.dump(annotations, f, indent=2) # Usage create_annotations("data/videos", "data/annotations.json") ``` ## Training ### Single GPU Training ```bash python train.py ``` Configuration in train.py: ```python config = { 'batch_size': 2, 'gradient_accumulation_steps': 8, # Effective batch size = 16 'learning_rate': 1e-4, 'num_epochs': 100, 'mixed_precision': True, } ``` ### Multi-GPU Training (Recommended) ```bash # Using PyTorch DDP torchrun --nproc_per_node=8 train.py # Or using accelerate (better) accelerate config # First time setup accelerate launch train.py ``` ### Monitoring Training ```bash # Install tensorboard pip install tensorboard # Run tensorboard tensorboard --logdir=./checkpoints/logs ``` ### Resume from Checkpoint ```python # In train.py, add: trainer.load_checkpoint('checkpoints/checkpoint_step_10000.pt') trainer.train() ``` ## Inference ### Basic Inference ```python from inference import generate_video_from_prompt video = generate_video_from_prompt( prompt="A serene lake with mountains", checkpoint_path="checkpoints/best.pt", output_path="output.mp4", num_steps=50, guidance_scale=7.5, seed=42 # For reproducibility ) ``` ### Batch Inference ```python from inference import batch_generate prompts = [ "A cat playing", "Ocean waves", "City at night" ] batch_generate( prompts=prompts, checkpoint_path="checkpoints/best.pt", output_dir="./outputs", num_steps=50 ) ``` ### Advanced Options ```python # Lower guidance for more creative results video = generate_video_from_prompt( prompt="Abstract art in motion", guidance_scale=5.0, # Lower = more creative num_steps=100, # More steps = higher quality ) # Fast generation (fewer steps) video = generate_video_from_prompt( prompt="Quick test", num_steps=20, # Faster but lower quality ) ``` ## Optimization Tips ### Memory Optimization 1. **Reduce Batch Size** ```python config['batch_size'] = 1 # Minimum config['gradient_accumulation_steps'] = 16 # Maintain effective batch size ``` 2. **Enable Gradient Checkpointing** ```python config['gradient_checkpointing'] = True ``` 3. **Use Mixed Precision** ```python config['mixed_precision'] = True # Always recommended ``` ### Speed Optimization 1. **Use Torch Compile** (PyTorch 2.0+) ```python model = torch.compile(model) ``` 2. **Enable cuDNN Benchmarking** ```python torch.backends.cudnn.benchmark = True ``` 3. **Pin Memory** ```python DataLoader(..., pin_memory=True) ``` ## Troubleshooting ### CUDA Out of Memory ```bash # Reduce batch size config['batch_size'] = 1 # Enable gradient checkpointing config['gradient_checkpointing'] = True # Clear cache torch.cuda.empty_cache() ``` ### Slow Training ```bash # Check GPU utilization nvidia-smi # Increase num_workers DataLoader(..., num_workers=8) # Enable mixed precision config['mixed_precision'] = True ``` ### NaN Loss ```python # Reduce learning rate config['learning_rate'] = 5e-5 # Enable gradient clipping (already included) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Check for NaN in data assert not torch.isnan(videos).any() ``` ### Model Not Learning ```python # Increase learning rate config['learning_rate'] = 2e-4 # Check data quality # Verify annotations are correct # Ensure videos are properly normalized # Reduce regularization config['weight_decay'] = 0.001 # Lower weight decay ``` ## Performance Benchmarks ### Training Speed (A100 80GB) | Batch Size | Grad Accum | Eff. Batch | Sec/Batch | Hours/100K steps | |------------|------------|------------|-----------|------------------| | 1 | 16 | 16 | 2.5 | 69 | | 2 | 8 | 16 | 2.5 | 69 | | 4 | 4 | 16 | 2.7 | 75 | ### Inference Speed | GPU | FP16 | Steps | Time/Video | |-----|------|-------|------------| | A100 80GB | Yes | 50 | 15s | | RTX 4090 | Yes | 50 | 25s | | RTX 3090 | Yes | 50 | 35s | ### Memory Usage | Operation | Batch Size | Memory (GB) | |-----------|------------|-------------| | Inference | 1 | 6 | | Training | 1 | 12 | | Training | 2 | 24 | | Training | 4 | 48 | ## Next Steps 1. **Prepare your dataset** - Collect and annotate videos 2. **Start training** - Begin with small dataset to verify 3. **Monitor progress** - Check loss, sample generations 4. **Fine-tune** - Adjust hyperparameters based on results 5. **Evaluate** - Test on held-out validation set 6. **Deploy** - Use for inference on new prompts ## Getting Help - GitHub Issues: Report bugs and ask questions - Documentation: Check README.md and ARCHITECTURE.md - Examples: See example scripts in the repository ## Additional Resources - [PyTorch Documentation](https://pytorch.org/docs/) - [Diffusion Models Explained](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/) - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - [DiT Paper](https://arxiv.org/abs/2212.09748)