File size: 9,826 Bytes

3d8856d

# TTV-1B: 1 Billion Parameter Text-to-Video Model

A state-of-the-art text-to-video generation model with 1 billion parameters, built using Diffusion Transformer (DiT) architecture with 3D spatiotemporal attention.

## 🎯 Model Overview

**TTV-1B** is a diffusion-based text-to-video model that generates high-quality 16-frame videos at 256x256 resolution from text prompts.

### Architecture Highlights

- **Total Parameters**: ~1.0 Billion
- **Architecture**: Diffusion Transformer (DiT)
- **Text Encoder**: 6-layer transformer (50M params)
- **Video Backbone**: 24 DiT blocks with 1536 hidden dimensions (950M params)
- **Attention**: 3D Spatiotemporal attention with rotary embeddings
- **Patch Size**: 2×16×16 (temporal × height × width)
- **Output**: 16 frames @ 256×256 resolution

## 📋 Features

✅ **Spatiotemporal 3D Attention** - Captures both spatial and temporal dependencies
✅ **Rotary Position Embeddings** - Better positional encoding for sequences
✅ **Adaptive Layer Normalization (AdaLN)** - Conditional generation via modulation
✅ **DDPM Diffusion Scheduler** - Proven denoising approach
✅ **Mixed Precision Training** - Faster training with lower memory
✅ **Gradient Accumulation** - Train with large effective batch sizes
✅ **Classifier-Free Guidance** - Better prompt adherence during inference

## 🚀 Quick Start

### Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/ttv-1b.git
cd ttv-1b

# Install dependencies
pip install -r requirements.txt
```

### Training

```python
from train import Trainer
from video_ttv_1b import create_model

# Create model
device = 'cuda'
model = create_model(device)

# Create datasets (replace with your data)
train_dataset = YourVideoDataset(...)
val_dataset = YourVideoDataset(...)

# Initialize trainer
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    batch_size=2,
    gradient_accumulation_steps=8,
    mixed_precision=True,
    learning_rate=1e-4,
    num_epochs=100,
)

# Start training
trainer.train()
```

Or use the training script:

```bash
python train.py
```

### Inference

```python
from inference import generate_video_from_prompt

# Generate video
video = generate_video_from_prompt(
    prompt="A cat playing with a ball of yarn",
    checkpoint_path="checkpoints/checkpoint_best.pt",
    output_path="output.mp4",
    num_steps=50,
    guidance_scale=7.5,
)
```

Or use the command line:

```bash
python inference.py \
    --prompt "A serene sunset over the ocean" \
    --checkpoint checkpoints/checkpoint_best.pt \
    --output generated_video.mp4 \
    --steps 50 \
    --guidance 7.5
```

## 🏗️ Model Architecture

```
Input: Text Prompt + Random Noise Video
                ↓
    ┌─────────────────────────┐
    │   Text Encoder (6L)     │
    │   768d, 12 heads        │
    └─────────────────────────┘
                ↓
    ┌─────────────────────────┐
    │   Text Projection       │
    │   768d → 1536d          │
    └─────────────────────────┘
                ↓
    ┌─────────────────────────┐
    │   3D Patch Embedding    │
    │   (2,16,16) patches     │
    └─────────────────────────┘
                ↓
    ┌─────────────────────────┐
    │   24× DiT Blocks        │
    │   • 3D Spatio-Temporal  │
    │     Attention (24 heads)│
    │   • Rotary Embeddings   │
    │   • AdaLN Modulation    │
    │   • Feed-Forward Net    │
    └─────────────────────────┘
                ↓
    ┌─────────────────────────┐
    │   Final Layer + AdaLN   │
    └─────────────────────────┘
                ↓
    ┌─────────────────────────┐
    │   Unpatchify to Video   │
    └─────────────────────────┘
                ↓
Output: Predicted Noise / Denoised Video
```

## 📊 Training Details

### Recommended Training Setup

- **GPU**: 8× A100 80GB (or equivalent)
- **Batch Size**: 2 per GPU
- **Gradient Accumulation**: 8 steps
- **Effective Batch Size**: 128
- **Learning Rate**: 1e-4 with cosine decay
- **Optimizer**: AdamW (β1=0.9, β2=0.999)
- **Weight Decay**: 0.01
- **Mixed Precision**: FP16
- **Training Steps**: ~500K

### Memory Requirements

- **Model**: ~4GB (FP32), ~2GB (FP16)
- **Activations**: ~8GB per sample (256×256×16)
- **Total per GPU**: ~12-16GB with batch size 2

### Training Time Estimates

- **Single A100 80GB**: ~4-6 weeks for 500K steps
- **8× A100 80GB**: ~4-7 days for 500K steps

## 🎨 Inference Examples

```python
# Example 1: Basic generation
from inference import VideoGenerator, load_model
from video_ttv_1b import DDPMScheduler

model = load_model("checkpoints/best.pt")
scheduler = DDPMScheduler()
generator = VideoGenerator(model, scheduler)

video = generator.generate(
    prompt="A beautiful waterfall in a lush forest",
    num_inference_steps=50,
)

# Example 2: Batch generation
from inference import batch_generate

prompts = [
    "A dog running in a park",
    "Fireworks in the night sky",
    "Ocean waves crashing on rocks",
]

batch_generate(
    prompts=prompts,
    checkpoint_path="checkpoints/best.pt",
    output_dir="./outputs",
    num_steps=50,
)
```

## 📈 Performance Metrics

| Metric | Value |
|--------|-------|
| Parameters | 1.0B |
| FLOPs (per frame) | ~250 GFLOPs |
| Inference Time (50 steps, A100) | ~15-20 seconds |
| Training Loss (final) | ~0.05 MSE |
| Video Quality (FVD) | TBD |

## 🔧 Hyperparameters

### Model Configuration

```python
VideoTTV1B(
    img_size=(256, 256),           # Output resolution
    num_frames=16,                 # Video length
    patch_size=(2, 16, 16),        # Patch dimensions
    in_channels=3,                 # RGB
    hidden_dim=1536,               # Model width
    depth=24,                      # Number of layers
    num_heads=24,                  # Attention heads
    mlp_ratio=4.0,                 # MLP expansion
    text_dim=768,                  # Text encoder dim
    vocab_size=50257,              # Vocabulary size
)
```

### Training Configuration

```python
Trainer(
    batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    weight_decay=0.01,
    num_epochs=100,
    mixed_precision=True,
)
```

## 📁 Project Structure

```
ttv-1b/
├── video_ttv_1b.py      # Model architecture
├── train.py             # Training script
├── inference.py         # Inference & generation
├── requirements.txt     # Dependencies
├── README.md           # Documentation
├── checkpoints/        # Model checkpoints
├── data/              # Training data
└── outputs/           # Generated videos
```

## 🔬 Technical Details

### 3D Spatiotemporal Attention

The model uses full 3D attention across time, height, and width dimensions:
- Captures motion dynamics and spatial relationships
- Rotary position embeddings for better sequence modeling
- Efficient implementation with Flash Attention compatible design

### Diffusion Process

1. **Training**: Learn to predict noise added to videos
2. **Inference**: Iteratively denoise random noise → video
3. **Guidance**: Classifier-free guidance for better text alignment

### Adaptive Layer Normalization

Each DiT block uses AdaLN-Zero for conditional generation:
- Text and timestep embeddings modulate layer norm parameters
- Allows model to adapt behavior based on conditioning

## 🎯 Use Cases

- **Creative Content**: Generate videos for social media, marketing
- **Prototyping**: Quick video mockups from descriptions
- **Education**: Visualize concepts and scenarios
- **Entertainment**: Generate animations and effects
- **Research**: Study video generation and diffusion models

## ⚠️ Limitations

- Maximum 16 frames (can be extended in future versions)
- 256×256 resolution (trade-off for 1B parameters)
- Requires significant compute for training
- Text encoder is simple (can be replaced with CLIP/T5)
- No temporal super-resolution (yet)

## 🚧 Future Improvements

- [ ] Increase resolution to 512×512
- [ ] Extend to 64+ frames
- [ ] Add temporal super-resolution
- [ ] Integrate CLIP text encoder
- [ ] Add motion control
- [ ] Implement video editing capabilities
- [ ] Optimize inference speed
- [ ] Add LoRA fine-tuning support

## 📚 Citation

If you use this model in your research, please cite:

```bibtex
@misc{ttv1b2024,
  title={TTV-1B: A 1 Billion Parameter Text-to-Video Model},
  author={Your Name},
  year={2024},
  url={https://github.com/yourusername/ttv-1b}
}
```

## 📄 License

This project is licensed under the MIT License - see LICENSE file for details.

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## 💬 Contact

For questions and feedback:
- GitHub Issues: [github.com/yourusername/ttv-1b/issues](https://github.com/yourusername/ttv-1b/issues)
- Email: your.email@example.com

## 🙏 Acknowledgments

- Inspired by DiT (Diffusion Transformer) architecture
- Built with PyTorch and modern deep learning practices
- Thanks to the open-source ML community

---

**Status**: Research/Educational Model | **Version**: 1.0.0 | **Last Updated**: 2024