| # TTV-1B Setup Guide | |
| Complete installation and setup instructions for the TTV-1B text-to-video model. | |
| ## Prerequisites | |
| ### Hardware Requirements | |
| #### Minimum (Inference Only) | |
| - GPU: 8GB VRAM (RTX 3070, RTX 4060 Ti) | |
| - RAM: 16GB | |
| - Storage: 50GB | |
| - OS: Ubuntu 20.04+, Windows 10+, macOS 12+ | |
| #### Recommended (Training) | |
| - GPU: 24GB+ VRAM (RTX 4090, A5000, A100) | |
| - RAM: 64GB | |
| - Storage: 500GB SSD | |
| - OS: Ubuntu 22.04 LTS | |
| #### Production (Full Training) | |
| - GPU: 8Γ A100 80GB | |
| - RAM: 512GB | |
| - Storage: 2TB NVMe SSD | |
| - Network: High-speed interconnect for multi-GPU | |
| ### Software Requirements | |
| - Python 3.9, 3.10, or 3.11 | |
| - CUDA 11.8+ (for GPU acceleration) | |
| - cuDNN 8.6+ | |
| - Git | |
| ## Installation | |
| ### Step 1: Clone Repository | |
| ```bash | |
| git clone https://github.com/yourusername/ttv-1b.git | |
| cd ttv-1b | |
| ``` | |
| ### Step 2: Create Virtual Environment | |
| ```bash | |
| # Using venv | |
| python3 -m venv venv | |
| source venv/bin/activate # Linux/Mac | |
| # or | |
| venv\Scripts\activate # Windows | |
| # Using conda (alternative) | |
| conda create -n ttv1b python=3.10 | |
| conda activate ttv1b | |
| ``` | |
| ### Step 3: Install PyTorch | |
| Choose the appropriate command for your system from https://pytorch.org/get-started/locally/ | |
| ```bash | |
| # CUDA 11.8 (most common) | |
| pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 | |
| # CUDA 12.1 | |
| pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121 | |
| # CPU only (not recommended) | |
| pip install torch torchvision | |
| ``` | |
| ### Step 4: Install Dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### Step 5: Verify Installation | |
| ```bash | |
| python -c "import torch; print(f'PyTorch {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')" | |
| ``` | |
| Expected output: | |
| ``` | |
| PyTorch 2.1.0 | |
| CUDA available: True | |
| ``` | |
| ## Quick Start | |
| ### Test the Model | |
| ```bash | |
| # Run evaluation script to verify everything works | |
| python evaluate.py | |
| ``` | |
| This will: | |
| - Create the model | |
| - Count parameters (should be ~1.0B) | |
| - Test forward/backward passes | |
| - Measure inference speed | |
| - Check memory usage | |
| ### Generate Your First Video (After Training) | |
| ```bash | |
| python inference.py \ | |
| --prompt "A beautiful sunset over mountains" \ | |
| --checkpoint checkpoints/checkpoint_best.pt \ | |
| --output my_first_video.mp4 \ | |
| --steps 50 | |
| ``` | |
| ## Preparing Data | |
| ### Data Format | |
| The model expects video-text pairs in the following format: | |
| ``` | |
| data/ | |
| βββ videos/ | |
| β βββ video_0001.mp4 | |
| β βββ video_0002.mp4 | |
| β βββ ... | |
| βββ annotations.json | |
| ``` | |
| annotations.json: | |
| ```json | |
| { | |
| "video_0001": { | |
| "caption": "A cat playing with a ball of yarn", | |
| "duration": 2.0, | |
| "fps": 8 | |
| }, | |
| "video_0002": { | |
| "caption": "Sunset over the ocean with waves", | |
| "duration": 2.0, | |
| "fps": 8 | |
| } | |
| } | |
| ``` | |
| ### Video Specifications | |
| - Format: MP4, AVI, or MOV | |
| - Resolution: 256Γ256 (will be resized) | |
| - Frame rate: 8 FPS recommended | |
| - Duration: 2 seconds (16 frames at 8 FPS) | |
| - Codec: H.264 recommended | |
| ### Converting Videos | |
| ```bash | |
| # Using FFmpeg to convert videos | |
| ffmpeg -i input.mp4 -vf "scale=256:256,fps=8" -t 2 -c:v libx264 output.mp4 | |
| ``` | |
| ### Dataset Preparation Script | |
| ```python | |
| import json | |
| from pathlib import Path | |
| def create_annotations(video_dir, output_file): | |
| """Create annotations file from videos""" | |
| video_dir = Path(video_dir) | |
| annotations = {} | |
| for video_path in video_dir.glob("*.mp4"): | |
| video_id = video_path.stem | |
| annotations[video_id] = { | |
| "caption": f"Video {video_id}", # Add actual captions | |
| "duration": 2.0, | |
| "fps": 8 | |
| } | |
| with open(output_file, 'w') as f: | |
| json.dump(annotations, f, indent=2) | |
| # Usage | |
| create_annotations("data/videos", "data/annotations.json") | |
| ``` | |
| ## Training | |
| ### Single GPU Training | |
| ```bash | |
| python train.py | |
| ``` | |
| Configuration in train.py: | |
| ```python | |
| config = { | |
| 'batch_size': 2, | |
| 'gradient_accumulation_steps': 8, # Effective batch size = 16 | |
| 'learning_rate': 1e-4, | |
| 'num_epochs': 100, | |
| 'mixed_precision': True, | |
| } | |
| ``` | |
| ### Multi-GPU Training (Recommended) | |
| ```bash | |
| # Using PyTorch DDP | |
| torchrun --nproc_per_node=8 train.py | |
| # Or using accelerate (better) | |
| accelerate config # First time setup | |
| accelerate launch train.py | |
| ``` | |
| ### Monitoring Training | |
| ```bash | |
| # Install tensorboard | |
| pip install tensorboard | |
| # Run tensorboard | |
| tensorboard --logdir=./checkpoints/logs | |
| ``` | |
| ### Resume from Checkpoint | |
| ```python | |
| # In train.py, add: | |
| trainer.load_checkpoint('checkpoints/checkpoint_step_10000.pt') | |
| trainer.train() | |
| ``` | |
| ## Inference | |
| ### Basic Inference | |
| ```python | |
| from inference import generate_video_from_prompt | |
| video = generate_video_from_prompt( | |
| prompt="A serene lake with mountains", | |
| checkpoint_path="checkpoints/best.pt", | |
| output_path="output.mp4", | |
| num_steps=50, | |
| guidance_scale=7.5, | |
| seed=42 # For reproducibility | |
| ) | |
| ``` | |
| ### Batch Inference | |
| ```python | |
| from inference import batch_generate | |
| prompts = [ | |
| "A cat playing", | |
| "Ocean waves", | |
| "City at night" | |
| ] | |
| batch_generate( | |
| prompts=prompts, | |
| checkpoint_path="checkpoints/best.pt", | |
| output_dir="./outputs", | |
| num_steps=50 | |
| ) | |
| ``` | |
| ### Advanced Options | |
| ```python | |
| # Lower guidance for more creative results | |
| video = generate_video_from_prompt( | |
| prompt="Abstract art in motion", | |
| guidance_scale=5.0, # Lower = more creative | |
| num_steps=100, # More steps = higher quality | |
| ) | |
| # Fast generation (fewer steps) | |
| video = generate_video_from_prompt( | |
| prompt="Quick test", | |
| num_steps=20, # Faster but lower quality | |
| ) | |
| ``` | |
| ## Optimization Tips | |
| ### Memory Optimization | |
| 1. **Reduce Batch Size** | |
| ```python | |
| config['batch_size'] = 1 # Minimum | |
| config['gradient_accumulation_steps'] = 16 # Maintain effective batch size | |
| ``` | |
| 2. **Enable Gradient Checkpointing** | |
| ```python | |
| config['gradient_checkpointing'] = True | |
| ``` | |
| 3. **Use Mixed Precision** | |
| ```python | |
| config['mixed_precision'] = True # Always recommended | |
| ``` | |
| ### Speed Optimization | |
| 1. **Use Torch Compile** (PyTorch 2.0+) | |
| ```python | |
| model = torch.compile(model) | |
| ``` | |
| 2. **Enable cuDNN Benchmarking** | |
| ```python | |
| torch.backends.cudnn.benchmark = True | |
| ``` | |
| 3. **Pin Memory** | |
| ```python | |
| DataLoader(..., pin_memory=True) | |
| ``` | |
| ## Troubleshooting | |
| ### CUDA Out of Memory | |
| ```bash | |
| # Reduce batch size | |
| config['batch_size'] = 1 | |
| # Enable gradient checkpointing | |
| config['gradient_checkpointing'] = True | |
| # Clear cache | |
| torch.cuda.empty_cache() | |
| ``` | |
| ### Slow Training | |
| ```bash | |
| # Check GPU utilization | |
| nvidia-smi | |
| # Increase num_workers | |
| DataLoader(..., num_workers=8) | |
| # Enable mixed precision | |
| config['mixed_precision'] = True | |
| ``` | |
| ### NaN Loss | |
| ```python | |
| # Reduce learning rate | |
| config['learning_rate'] = 5e-5 | |
| # Enable gradient clipping (already included) | |
| torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) | |
| # Check for NaN in data | |
| assert not torch.isnan(videos).any() | |
| ``` | |
| ### Model Not Learning | |
| ```python | |
| # Increase learning rate | |
| config['learning_rate'] = 2e-4 | |
| # Check data quality | |
| # Verify annotations are correct | |
| # Ensure videos are properly normalized | |
| # Reduce regularization | |
| config['weight_decay'] = 0.001 # Lower weight decay | |
| ``` | |
| ## Performance Benchmarks | |
| ### Training Speed (A100 80GB) | |
| | Batch Size | Grad Accum | Eff. Batch | Sec/Batch | Hours/100K steps | | |
| |------------|------------|------------|-----------|------------------| | |
| | 1 | 16 | 16 | 2.5 | 69 | | |
| | 2 | 8 | 16 | 2.5 | 69 | | |
| | 4 | 4 | 16 | 2.7 | 75 | | |
| ### Inference Speed | |
| | GPU | FP16 | Steps | Time/Video | | |
| |-----|------|-------|------------| | |
| | A100 80GB | Yes | 50 | 15s | | |
| | RTX 4090 | Yes | 50 | 25s | | |
| | RTX 3090 | Yes | 50 | 35s | | |
| ### Memory Usage | |
| | Operation | Batch Size | Memory (GB) | | |
| |-----------|------------|-------------| | |
| | Inference | 1 | 6 | | |
| | Training | 1 | 12 | | |
| | Training | 2 | 24 | | |
| | Training | 4 | 48 | | |
| ## Next Steps | |
| 1. **Prepare your dataset** - Collect and annotate videos | |
| 2. **Start training** - Begin with small dataset to verify | |
| 3. **Monitor progress** - Check loss, sample generations | |
| 4. **Fine-tune** - Adjust hyperparameters based on results | |
| 5. **Evaluate** - Test on held-out validation set | |
| 6. **Deploy** - Use for inference on new prompts | |
| ## Getting Help | |
| - GitHub Issues: Report bugs and ask questions | |
| - Documentation: Check README.md and ARCHITECTURE.md | |
| - Examples: See example scripts in the repository | |
| ## Additional Resources | |
| - [PyTorch Documentation](https://pytorch.org/docs/) | |
| - [Diffusion Models Explained](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/) | |
| - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) | |
| - [DiT Paper](https://arxiv.org/abs/2212.09748) | |