TTV-1B Setup Guide
Complete installation and setup instructions for the TTV-1B text-to-video model.
Prerequisites
Hardware Requirements
Minimum (Inference Only)
- GPU: 8GB VRAM (RTX 3070, RTX 4060 Ti)
- RAM: 16GB
- Storage: 50GB
- OS: Ubuntu 20.04+, Windows 10+, macOS 12+
Recommended (Training)
- GPU: 24GB+ VRAM (RTX 4090, A5000, A100)
- RAM: 64GB
- Storage: 500GB SSD
- OS: Ubuntu 22.04 LTS
Production (Full Training)
- GPU: 8Γ A100 80GB
- RAM: 512GB
- Storage: 2TB NVMe SSD
- Network: High-speed interconnect for multi-GPU
Software Requirements
- Python 3.9, 3.10, or 3.11
- CUDA 11.8+ (for GPU acceleration)
- cuDNN 8.6+
- Git
Installation
Step 1: Clone Repository
git clone https://github.com/yourusername/ttv-1b.git
cd ttv-1b
Step 2: Create Virtual Environment
# Using venv
python3 -m venv venv
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windows
# Using conda (alternative)
conda create -n ttv1b python=3.10
conda activate ttv1b
Step 3: Install PyTorch
Choose the appropriate command for your system from https://pytorch.org/get-started/locally/
# CUDA 11.8 (most common)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# CUDA 12.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# CPU only (not recommended)
pip install torch torchvision
Step 4: Install Dependencies
pip install -r requirements.txt
Step 5: Verify Installation
python -c "import torch; print(f'PyTorch {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"
Expected output:
PyTorch 2.1.0
CUDA available: True
Quick Start
Test the Model
# Run evaluation script to verify everything works
python evaluate.py
This will:
- Create the model
- Count parameters (should be ~1.0B)
- Test forward/backward passes
- Measure inference speed
- Check memory usage
Generate Your First Video (After Training)
python inference.py \
--prompt "A beautiful sunset over mountains" \
--checkpoint checkpoints/checkpoint_best.pt \
--output my_first_video.mp4 \
--steps 50
Preparing Data
Data Format
The model expects video-text pairs in the following format:
data/
βββ videos/
β βββ video_0001.mp4
β βββ video_0002.mp4
β βββ ...
βββ annotations.json
annotations.json:
{
"video_0001": {
"caption": "A cat playing with a ball of yarn",
"duration": 2.0,
"fps": 8
},
"video_0002": {
"caption": "Sunset over the ocean with waves",
"duration": 2.0,
"fps": 8
}
}
Video Specifications
- Format: MP4, AVI, or MOV
- Resolution: 256Γ256 (will be resized)
- Frame rate: 8 FPS recommended
- Duration: 2 seconds (16 frames at 8 FPS)
- Codec: H.264 recommended
Converting Videos
# Using FFmpeg to convert videos
ffmpeg -i input.mp4 -vf "scale=256:256,fps=8" -t 2 -c:v libx264 output.mp4
Dataset Preparation Script
import json
from pathlib import Path
def create_annotations(video_dir, output_file):
"""Create annotations file from videos"""
video_dir = Path(video_dir)
annotations = {}
for video_path in video_dir.glob("*.mp4"):
video_id = video_path.stem
annotations[video_id] = {
"caption": f"Video {video_id}", # Add actual captions
"duration": 2.0,
"fps": 8
}
with open(output_file, 'w') as f:
json.dump(annotations, f, indent=2)
# Usage
create_annotations("data/videos", "data/annotations.json")
Training
Single GPU Training
python train.py
Configuration in train.py:
config = {
'batch_size': 2,
'gradient_accumulation_steps': 8, # Effective batch size = 16
'learning_rate': 1e-4,
'num_epochs': 100,
'mixed_precision': True,
}
Multi-GPU Training (Recommended)
# Using PyTorch DDP
torchrun --nproc_per_node=8 train.py
# Or using accelerate (better)
accelerate config # First time setup
accelerate launch train.py
Monitoring Training
# Install tensorboard
pip install tensorboard
# Run tensorboard
tensorboard --logdir=./checkpoints/logs
Resume from Checkpoint
# In train.py, add:
trainer.load_checkpoint('checkpoints/checkpoint_step_10000.pt')
trainer.train()
Inference
Basic Inference
from inference import generate_video_from_prompt
video = generate_video_from_prompt(
prompt="A serene lake with mountains",
checkpoint_path="checkpoints/best.pt",
output_path="output.mp4",
num_steps=50,
guidance_scale=7.5,
seed=42 # For reproducibility
)
Batch Inference
from inference import batch_generate
prompts = [
"A cat playing",
"Ocean waves",
"City at night"
]
batch_generate(
prompts=prompts,
checkpoint_path="checkpoints/best.pt",
output_dir="./outputs",
num_steps=50
)
Advanced Options
# Lower guidance for more creative results
video = generate_video_from_prompt(
prompt="Abstract art in motion",
guidance_scale=5.0, # Lower = more creative
num_steps=100, # More steps = higher quality
)
# Fast generation (fewer steps)
video = generate_video_from_prompt(
prompt="Quick test",
num_steps=20, # Faster but lower quality
)
Optimization Tips
Memory Optimization
- Reduce Batch Size
config['batch_size'] = 1 # Minimum
config['gradient_accumulation_steps'] = 16 # Maintain effective batch size
- Enable Gradient Checkpointing
config['gradient_checkpointing'] = True
- Use Mixed Precision
config['mixed_precision'] = True # Always recommended
Speed Optimization
- Use Torch Compile (PyTorch 2.0+)
model = torch.compile(model)
- Enable cuDNN Benchmarking
torch.backends.cudnn.benchmark = True
- Pin Memory
DataLoader(..., pin_memory=True)
Troubleshooting
CUDA Out of Memory
# Reduce batch size
config['batch_size'] = 1
# Enable gradient checkpointing
config['gradient_checkpointing'] = True
# Clear cache
torch.cuda.empty_cache()
Slow Training
# Check GPU utilization
nvidia-smi
# Increase num_workers
DataLoader(..., num_workers=8)
# Enable mixed precision
config['mixed_precision'] = True
NaN Loss
# Reduce learning rate
config['learning_rate'] = 5e-5
# Enable gradient clipping (already included)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Check for NaN in data
assert not torch.isnan(videos).any()
Model Not Learning
# Increase learning rate
config['learning_rate'] = 2e-4
# Check data quality
# Verify annotations are correct
# Ensure videos are properly normalized
# Reduce regularization
config['weight_decay'] = 0.001 # Lower weight decay
Performance Benchmarks
Training Speed (A100 80GB)
| Batch Size | Grad Accum | Eff. Batch | Sec/Batch | Hours/100K steps |
|---|---|---|---|---|
| 1 | 16 | 16 | 2.5 | 69 |
| 2 | 8 | 16 | 2.5 | 69 |
| 4 | 4 | 16 | 2.7 | 75 |
Inference Speed
| GPU | FP16 | Steps | Time/Video |
|---|---|---|---|
| A100 80GB | Yes | 50 | 15s |
| RTX 4090 | Yes | 50 | 25s |
| RTX 3090 | Yes | 50 | 35s |
Memory Usage
| Operation | Batch Size | Memory (GB) |
|---|---|---|
| Inference | 1 | 6 |
| Training | 1 | 12 |
| Training | 2 | 24 |
| Training | 4 | 48 |
Next Steps
- Prepare your dataset - Collect and annotate videos
- Start training - Begin with small dataset to verify
- Monitor progress - Check loss, sample generations
- Fine-tune - Adjust hyperparameters based on results
- Evaluate - Test on held-out validation set
- Deploy - Use for inference on new prompts
Getting Help
- GitHub Issues: Report bugs and ask questions
- Documentation: Check README.md and ARCHITECTURE.md
- Examples: See example scripts in the repository