# TTV-1B: 1 Billion Parameter Text-to-Video Model A state-of-the-art text-to-video generation model with 1 billion parameters, built using Diffusion Transformer (DiT) architecture with 3D spatiotemporal attention. ## 🎯 Model Overview **TTV-1B** is a diffusion-based text-to-video model that generates high-quality 16-frame videos at 256x256 resolution from text prompts. ### Architecture Highlights - **Total Parameters**: ~1.0 Billion - **Architecture**: Diffusion Transformer (DiT) - **Text Encoder**: 6-layer transformer (50M params) - **Video Backbone**: 24 DiT blocks with 1536 hidden dimensions (950M params) - **Attention**: 3D Spatiotemporal attention with rotary embeddings - **Patch Size**: 2×16×16 (temporal × height × width) - **Output**: 16 frames @ 256×256 resolution ## 📋 Features ✅ **Spatiotemporal 3D Attention** - Captures both spatial and temporal dependencies ✅ **Rotary Position Embeddings** - Better positional encoding for sequences ✅ **Adaptive Layer Normalization (AdaLN)** - Conditional generation via modulation ✅ **DDPM Diffusion Scheduler** - Proven denoising approach ✅ **Mixed Precision Training** - Faster training with lower memory ✅ **Gradient Accumulation** - Train with large effective batch sizes ✅ **Classifier-Free Guidance** - Better prompt adherence during inference ## 🚀 Quick Start ### Installation ```bash # Clone the repository git clone https://github.com/yourusername/ttv-1b.git cd ttv-1b # Install dependencies pip install -r requirements.txt ``` ### Training ```python from train import Trainer from video_ttv_1b import create_model # Create model device = 'cuda' model = create_model(device) # Create datasets (replace with your data) train_dataset = YourVideoDataset(...) val_dataset = YourVideoDataset(...) # Initialize trainer trainer = Trainer( model=model, train_dataset=train_dataset, val_dataset=val_dataset, batch_size=2, gradient_accumulation_steps=8, mixed_precision=True, learning_rate=1e-4, num_epochs=100, ) # Start training trainer.train() ``` Or use the training script: ```bash python train.py ``` ### Inference ```python from inference import generate_video_from_prompt # Generate video video = generate_video_from_prompt( prompt="A cat playing with a ball of yarn", checkpoint_path="checkpoints/checkpoint_best.pt", output_path="output.mp4", num_steps=50, guidance_scale=7.5, ) ``` Or use the command line: ```bash python inference.py \ --prompt "A serene sunset over the ocean" \ --checkpoint checkpoints/checkpoint_best.pt \ --output generated_video.mp4 \ --steps 50 \ --guidance 7.5 ``` ## 🏗️ Model Architecture ``` Input: Text Prompt + Random Noise Video ↓ ┌─────────────────────────┐ │ Text Encoder (6L) │ │ 768d, 12 heads │ └─────────────────────────┘ ↓ ┌─────────────────────────┐ │ Text Projection │ │ 768d → 1536d │ └─────────────────────────┘ ↓ ┌─────────────────────────┐ │ 3D Patch Embedding │ │ (2,16,16) patches │ └─────────────────────────┘ ↓ ┌─────────────────────────┐ │ 24× DiT Blocks │ │ • 3D Spatio-Temporal │ │ Attention (24 heads)│ │ • Rotary Embeddings │ │ • AdaLN Modulation │ │ • Feed-Forward Net │ └─────────────────────────┘ ↓ ┌─────────────────────────┐ │ Final Layer + AdaLN │ └─────────────────────────┘ ↓ ┌─────────────────────────┐ │ Unpatchify to Video │ └─────────────────────────┘ ↓ Output: Predicted Noise / Denoised Video ``` ## 📊 Training Details ### Recommended Training Setup - **GPU**: 8× A100 80GB (or equivalent) - **Batch Size**: 2 per GPU - **Gradient Accumulation**: 8 steps - **Effective Batch Size**: 128 - **Learning Rate**: 1e-4 with cosine decay - **Optimizer**: AdamW (β1=0.9, β2=0.999) - **Weight Decay**: 0.01 - **Mixed Precision**: FP16 - **Training Steps**: ~500K ### Memory Requirements - **Model**: ~4GB (FP32), ~2GB (FP16) - **Activations**: ~8GB per sample (256×256×16) - **Total per GPU**: ~12-16GB with batch size 2 ### Training Time Estimates - **Single A100 80GB**: ~4-6 weeks for 500K steps - **8× A100 80GB**: ~4-7 days for 500K steps ## 🎨 Inference Examples ```python # Example 1: Basic generation from inference import VideoGenerator, load_model from video_ttv_1b import DDPMScheduler model = load_model("checkpoints/best.pt") scheduler = DDPMScheduler() generator = VideoGenerator(model, scheduler) video = generator.generate( prompt="A beautiful waterfall in a lush forest", num_inference_steps=50, ) # Example 2: Batch generation from inference import batch_generate prompts = [ "A dog running in a park", "Fireworks in the night sky", "Ocean waves crashing on rocks", ] batch_generate( prompts=prompts, checkpoint_path="checkpoints/best.pt", output_dir="./outputs", num_steps=50, ) ``` ## 📈 Performance Metrics | Metric | Value | |--------|-------| | Parameters | 1.0B | | FLOPs (per frame) | ~250 GFLOPs | | Inference Time (50 steps, A100) | ~15-20 seconds | | Training Loss (final) | ~0.05 MSE | | Video Quality (FVD) | TBD | ## 🔧 Hyperparameters ### Model Configuration ```python VideoTTV1B( img_size=(256, 256), # Output resolution num_frames=16, # Video length patch_size=(2, 16, 16), # Patch dimensions in_channels=3, # RGB hidden_dim=1536, # Model width depth=24, # Number of layers num_heads=24, # Attention heads mlp_ratio=4.0, # MLP expansion text_dim=768, # Text encoder dim vocab_size=50257, # Vocabulary size ) ``` ### Training Configuration ```python Trainer( batch_size=2, gradient_accumulation_steps=8, learning_rate=1e-4, weight_decay=0.01, num_epochs=100, mixed_precision=True, ) ``` ## 📁 Project Structure ``` ttv-1b/ ├── video_ttv_1b.py # Model architecture ├── train.py # Training script ├── inference.py # Inference & generation ├── requirements.txt # Dependencies ├── README.md # Documentation ├── checkpoints/ # Model checkpoints ├── data/ # Training data └── outputs/ # Generated videos ``` ## 🔬 Technical Details ### 3D Spatiotemporal Attention The model uses full 3D attention across time, height, and width dimensions: - Captures motion dynamics and spatial relationships - Rotary position embeddings for better sequence modeling - Efficient implementation with Flash Attention compatible design ### Diffusion Process 1. **Training**: Learn to predict noise added to videos 2. **Inference**: Iteratively denoise random noise → video 3. **Guidance**: Classifier-free guidance for better text alignment ### Adaptive Layer Normalization Each DiT block uses AdaLN-Zero for conditional generation: - Text and timestep embeddings modulate layer norm parameters - Allows model to adapt behavior based on conditioning ## 🎯 Use Cases - **Creative Content**: Generate videos for social media, marketing - **Prototyping**: Quick video mockups from descriptions - **Education**: Visualize concepts and scenarios - **Entertainment**: Generate animations and effects - **Research**: Study video generation and diffusion models ## ⚠️ Limitations - Maximum 16 frames (can be extended in future versions) - 256×256 resolution (trade-off for 1B parameters) - Requires significant compute for training - Text encoder is simple (can be replaced with CLIP/T5) - No temporal super-resolution (yet) ## 🚧 Future Improvements - [ ] Increase resolution to 512×512 - [ ] Extend to 64+ frames - [ ] Add temporal super-resolution - [ ] Integrate CLIP text encoder - [ ] Add motion control - [ ] Implement video editing capabilities - [ ] Optimize inference speed - [ ] Add LoRA fine-tuning support ## 📚 Citation If you use this model in your research, please cite: ```bibtex @misc{ttv1b2024, title={TTV-1B: A 1 Billion Parameter Text-to-Video Model}, author={Your Name}, year={2024}, url={https://github.com/yourusername/ttv-1b} } ``` ## 📄 License This project is licensed under the MIT License - see LICENSE file for details. ## 🤝 Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## 💬 Contact For questions and feedback: - GitHub Issues: [github.com/yourusername/ttv-1b/issues](https://github.com/yourusername/ttv-1b/issues) - Email: your.email@example.com ## 🙏 Acknowledgments - Inspired by DiT (Diffusion Transformer) architecture - Built with PyTorch and modern deep learning practices - Thanks to the open-source ML community --- **Status**: Research/Educational Model | **Version**: 1.0.0 | **Last Updated**: 2024