Zenderos / README.md
ASADSANAN's picture
Upload 11 files
3d8856d verified
# TTV-1B: 1 Billion Parameter Text-to-Video Model
A state-of-the-art text-to-video generation model with 1 billion parameters, built using Diffusion Transformer (DiT) architecture with 3D spatiotemporal attention.
## 🎯 Model Overview
**TTV-1B** is a diffusion-based text-to-video model that generates high-quality 16-frame videos at 256x256 resolution from text prompts.
### Architecture Highlights
- **Total Parameters**: ~1.0 Billion
- **Architecture**: Diffusion Transformer (DiT)
- **Text Encoder**: 6-layer transformer (50M params)
- **Video Backbone**: 24 DiT blocks with 1536 hidden dimensions (950M params)
- **Attention**: 3D Spatiotemporal attention with rotary embeddings
- **Patch Size**: 2Γ—16Γ—16 (temporal Γ— height Γ— width)
- **Output**: 16 frames @ 256Γ—256 resolution
## πŸ“‹ Features
βœ… **Spatiotemporal 3D Attention** - Captures both spatial and temporal dependencies
βœ… **Rotary Position Embeddings** - Better positional encoding for sequences
βœ… **Adaptive Layer Normalization (AdaLN)** - Conditional generation via modulation
βœ… **DDPM Diffusion Scheduler** - Proven denoising approach
βœ… **Mixed Precision Training** - Faster training with lower memory
βœ… **Gradient Accumulation** - Train with large effective batch sizes
βœ… **Classifier-Free Guidance** - Better prompt adherence during inference
## πŸš€ Quick Start
### Installation
```bash
# Clone the repository
git clone https://github.com/yourusername/ttv-1b.git
cd ttv-1b
# Install dependencies
pip install -r requirements.txt
```
### Training
```python
from train import Trainer
from video_ttv_1b import create_model
# Create model
device = 'cuda'
model = create_model(device)
# Create datasets (replace with your data)
train_dataset = YourVideoDataset(...)
val_dataset = YourVideoDataset(...)
# Initialize trainer
trainer = Trainer(
model=model,
train_dataset=train_dataset,
val_dataset=val_dataset,
batch_size=2,
gradient_accumulation_steps=8,
mixed_precision=True,
learning_rate=1e-4,
num_epochs=100,
)
# Start training
trainer.train()
```
Or use the training script:
```bash
python train.py
```
### Inference
```python
from inference import generate_video_from_prompt
# Generate video
video = generate_video_from_prompt(
prompt="A cat playing with a ball of yarn",
checkpoint_path="checkpoints/checkpoint_best.pt",
output_path="output.mp4",
num_steps=50,
guidance_scale=7.5,
)
```
Or use the command line:
```bash
python inference.py \
--prompt "A serene sunset over the ocean" \
--checkpoint checkpoints/checkpoint_best.pt \
--output generated_video.mp4 \
--steps 50 \
--guidance 7.5
```
## πŸ—οΈ Model Architecture
```
Input: Text Prompt + Random Noise Video
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Text Encoder (6L) β”‚
β”‚ 768d, 12 heads β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Text Projection β”‚
β”‚ 768d β†’ 1536d β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3D Patch Embedding β”‚
β”‚ (2,16,16) patches β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 24Γ— DiT Blocks β”‚
β”‚ β€’ 3D Spatio-Temporal β”‚
β”‚ Attention (24 heads)β”‚
β”‚ β€’ Rotary Embeddings β”‚
β”‚ β€’ AdaLN Modulation β”‚
β”‚ β€’ Feed-Forward Net β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Final Layer + AdaLN β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Unpatchify to Video β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
Output: Predicted Noise / Denoised Video
```
## πŸ“Š Training Details
### Recommended Training Setup
- **GPU**: 8Γ— A100 80GB (or equivalent)
- **Batch Size**: 2 per GPU
- **Gradient Accumulation**: 8 steps
- **Effective Batch Size**: 128
- **Learning Rate**: 1e-4 with cosine decay
- **Optimizer**: AdamW (Ξ²1=0.9, Ξ²2=0.999)
- **Weight Decay**: 0.01
- **Mixed Precision**: FP16
- **Training Steps**: ~500K
### Memory Requirements
- **Model**: ~4GB (FP32), ~2GB (FP16)
- **Activations**: ~8GB per sample (256Γ—256Γ—16)
- **Total per GPU**: ~12-16GB with batch size 2
### Training Time Estimates
- **Single A100 80GB**: ~4-6 weeks for 500K steps
- **8Γ— A100 80GB**: ~4-7 days for 500K steps
## 🎨 Inference Examples
```python
# Example 1: Basic generation
from inference import VideoGenerator, load_model
from video_ttv_1b import DDPMScheduler
model = load_model("checkpoints/best.pt")
scheduler = DDPMScheduler()
generator = VideoGenerator(model, scheduler)
video = generator.generate(
prompt="A beautiful waterfall in a lush forest",
num_inference_steps=50,
)
# Example 2: Batch generation
from inference import batch_generate
prompts = [
"A dog running in a park",
"Fireworks in the night sky",
"Ocean waves crashing on rocks",
]
batch_generate(
prompts=prompts,
checkpoint_path="checkpoints/best.pt",
output_dir="./outputs",
num_steps=50,
)
```
## πŸ“ˆ Performance Metrics
| Metric | Value |
|--------|-------|
| Parameters | 1.0B |
| FLOPs (per frame) | ~250 GFLOPs |
| Inference Time (50 steps, A100) | ~15-20 seconds |
| Training Loss (final) | ~0.05 MSE |
| Video Quality (FVD) | TBD |
## πŸ”§ Hyperparameters
### Model Configuration
```python
VideoTTV1B(
img_size=(256, 256), # Output resolution
num_frames=16, # Video length
patch_size=(2, 16, 16), # Patch dimensions
in_channels=3, # RGB
hidden_dim=1536, # Model width
depth=24, # Number of layers
num_heads=24, # Attention heads
mlp_ratio=4.0, # MLP expansion
text_dim=768, # Text encoder dim
vocab_size=50257, # Vocabulary size
)
```
### Training Configuration
```python
Trainer(
batch_size=2,
gradient_accumulation_steps=8,
learning_rate=1e-4,
weight_decay=0.01,
num_epochs=100,
mixed_precision=True,
)
```
## πŸ“ Project Structure
```
ttv-1b/
β”œβ”€β”€ video_ttv_1b.py # Model architecture
β”œβ”€β”€ train.py # Training script
β”œβ”€β”€ inference.py # Inference & generation
β”œβ”€β”€ requirements.txt # Dependencies
β”œβ”€β”€ README.md # Documentation
β”œβ”€β”€ checkpoints/ # Model checkpoints
β”œβ”€β”€ data/ # Training data
└── outputs/ # Generated videos
```
## πŸ”¬ Technical Details
### 3D Spatiotemporal Attention
The model uses full 3D attention across time, height, and width dimensions:
- Captures motion dynamics and spatial relationships
- Rotary position embeddings for better sequence modeling
- Efficient implementation with Flash Attention compatible design
### Diffusion Process
1. **Training**: Learn to predict noise added to videos
2. **Inference**: Iteratively denoise random noise β†’ video
3. **Guidance**: Classifier-free guidance for better text alignment
### Adaptive Layer Normalization
Each DiT block uses AdaLN-Zero for conditional generation:
- Text and timestep embeddings modulate layer norm parameters
- Allows model to adapt behavior based on conditioning
## 🎯 Use Cases
- **Creative Content**: Generate videos for social media, marketing
- **Prototyping**: Quick video mockups from descriptions
- **Education**: Visualize concepts and scenarios
- **Entertainment**: Generate animations and effects
- **Research**: Study video generation and diffusion models
## ⚠️ Limitations
- Maximum 16 frames (can be extended in future versions)
- 256Γ—256 resolution (trade-off for 1B parameters)
- Requires significant compute for training
- Text encoder is simple (can be replaced with CLIP/T5)
- No temporal super-resolution (yet)
## 🚧 Future Improvements
- [ ] Increase resolution to 512Γ—512
- [ ] Extend to 64+ frames
- [ ] Add temporal super-resolution
- [ ] Integrate CLIP text encoder
- [ ] Add motion control
- [ ] Implement video editing capabilities
- [ ] Optimize inference speed
- [ ] Add LoRA fine-tuning support
## πŸ“š Citation
If you use this model in your research, please cite:
```bibtex
@misc{ttv1b2024,
title={TTV-1B: A 1 Billion Parameter Text-to-Video Model},
author={Your Name},
year={2024},
url={https://github.com/yourusername/ttv-1b}
}
```
## πŸ“„ License
This project is licensed under the MIT License - see LICENSE file for details.
## 🀝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## πŸ’¬ Contact
For questions and feedback:
- GitHub Issues: [github.com/yourusername/ttv-1b/issues](https://github.com/yourusername/ttv-1b/issues)
- Email: your.email@example.com
## πŸ™ Acknowledgments
- Inspired by DiT (Diffusion Transformer) architecture
- Built with PyTorch and modern deep learning practices
- Thanks to the open-source ML community
---
**Status**: Research/Educational Model | **Version**: 1.0.0 | **Last Updated**: 2024