# TTV-1B Model Architecture Specification ## Model Summary **Name:** TTV-1B (Text-to-Video 1 Billion) **Type:** Diffusion Transformer for Text-to-Video Generation **Total Parameters:** 1,003,147,264 (~1.0 Billion) ## Architecture Components ### 1. Text Encoder (50M parameters) ``` Input: Text tokens (batch_size, 256) Architecture: - Token Embedding: 50,257 vocab → 768 dim - Position Embedding: 256 positions → 768 dim - 6 Transformer Layers: * Multi-head Attention (12 heads) * Feed-forward (768 → 3072 → 768) * Layer Normalization Output: Text features (batch_size, 256, 768) ``` ### 2. Text Projection Layer ``` Linear: 768 → 1536 dimensions Purpose: Project text features to model hidden dimension ``` ### 3. 3D Patch Embedding ``` Input: Video (batch_size, 3, 16, 256, 256) Patch size: (2, 16, 16) - temporal × height × width Conv3D: 3 channels → 1536 channels Output: (batch_size, 128, 1536) where 128 = (16/2) × (256/16) × (256/16) = 8 × 16 × 16 ``` ### 4. Positional Embedding ``` Learnable position embeddings for 128 patches Shape: (1, 128, 1536) ``` ### 5. Timestep Embedding ``` Sinusoidal timestep encoding → Linear(1536, 6144) → SiLU → Linear(6144, 1536) Output: Conditioning vector (batch_size, 1536) ``` ### 6. DiT Blocks (24 layers, 950M parameters) Each block contains: #### a) 3D Spatiotemporal Attention ``` - Query, Key, Value projections: Linear(1536, 4608) - 24 attention heads (64 dimensions each) - Rotary position embeddings on temporal dimension - Scaled dot-product attention - Output projection: Linear(1536, 1536) ``` #### b) Feed-Forward Network ``` - Linear: 1536 → 6144 (4x expansion) - GELU activation - Linear: 6144 → 1536 ``` #### c) Adaptive Layer Normalization (AdaLN) ``` - Modulation network: SiLU → Linear(1536, 9216) - Generates 6 modulation parameters: * scale_msa, shift_msa, gate_msa (for attention) * scale_mlp, shift_mlp, gate_mlp (for FFN) ``` ### 7. Final Layer ``` - Adaptive LayerNorm - Linear: 1536 → 768 (2×16×16×3) Purpose: Map back to patch space ``` ### 8. Unpatchify ``` Reshape patches back to video (batch_size, 128, 768) → (batch_size, 3, 16, 256, 256) ``` ## Parameter Breakdown | Component | Parameters | Percentage | |-----------|------------|------------| | Text Encoder | 50,331,648 | 5.0% | | Text Projection | 1,180,416 | 0.1% | | Patch Embedding | 589,824 | 0.1% | | Position Embedding | 196,608 | 0.02% | | Timestep Embedding | 14,157,312 | 1.4% | | DiT Blocks (24×) | 927,711,744 | 92.5% | | Final Layer | 8,979,712 | 0.9% | | **Total** | **1,003,147,264** | **100%** | ## Per-Block Parameters (DiT) Each of 24 DiT blocks contains ~38.7M parameters: | Sub-component | Parameters | |---------------|------------| | Attention QKV | 7,077,888 | | Attention Proj | 2,362,368 | | Rotary Embedding | 48 | | FFN Layer 1 | 9,443,328 | | FFN Layer 2 | 9,443,328 | | AdaLN Modulation | 14,155,776 | | Layer Norms | 0 (no learnable params) | | **Per Block Total** | **38,654,656** | ## Data Flow ``` 1. Text Input (batch, 256 tokens) ↓ 2. Text Encoder (6 transformer layers) ↓ 3. Text Features (batch, 256, 768) → Pool → (batch, 768) ↓ 4. Project to 1536 dim → (batch, 1536) ↓ 5. Add Timestep Embedding → Conditioning (batch, 1536) ↓ 6. Video Input (batch, 3, 16, 256, 256) ↓ 7. 3D Patch Embed → (batch, 128, 1536) ↓ 8. Add Position Embedding ↓ 9. 24× DiT Blocks (with conditioning) ↓ 10. Final Layer + AdaLN ↓ 11. Unpatchify ↓ 12. Output: Predicted Noise (batch, 3, 16, 256, 256) ``` ## Memory Requirements ### Model Weights - FP32: ~4.0 GB - FP16: ~2.0 GB - INT8: ~1.0 GB ### Activations (per sample, 256×256×16) - Forward pass: ~8 GB (FP16) - Backward pass: ~16 GB (FP16) ### Training (batch_size=2, FP16, gradient accumulation=8) - Model: 2 GB - Optimizer states (AdamW): 4 GB - Gradients: 2 GB - Activations: 16 GB - **Total: ~24 GB per GPU** ### Inference (batch_size=1, FP16) - Model: 2 GB - Activations: 4 GB - **Total: ~6 GB** ## Computational Complexity ### FLOPs per forward pass (approximate) - Text Encoder: ~10 GFLOPs - Patch Embedding: ~5 GFLOPs - DiT Blocks (24×): ~4,800 GFLOPs - Unpatchify: ~1 GFLOPs - **Total: ~4,816 GFLOPs per video** ### Training Speed Estimates - Single A100 80GB: ~2-3 seconds per batch (batch_size=2) - 8× A100 80GB: ~2-3 seconds per batch (batch_size=16) ### Inference Speed Estimates - A100 80GB (50 denoising steps): ~15-20 seconds per video - RTX 4090 (50 denoising steps): ~25-35 seconds per video ## Diffusion Scheduler ### DDPM (Denoising Diffusion Probabilistic Model) - Training steps: 1000 - Beta schedule: Linear (0.0001 → 0.02) - Loss: MSE between predicted and actual noise - Sampling: Iterative denoising from T=999 to T=0 ### Classifier-Free Guidance - Unconditional dropout during training: 10% - Guidance scale at inference: 7.5 (typical) - Formula: `noise_pred = noise_uncond + guidance_scale × (noise_cond - noise_uncond)` ## Key Features 1. **3D Spatiotemporal Attention** - Full attention across time, height, and width - Captures motion dynamics and spatial relationships 2. **Rotary Position Embeddings** - Applied to temporal dimension - Better sequence modeling than learned embeddings 3. **Adaptive Layer Normalization** - Conditions on text and timestep - Allows flexible control over generation 4. **Efficient Design** - Patch-based processing reduces sequence length - Mixed precision training support - Gradient checkpointing compatible ## Comparison with Other Models | Model | Parameters | Resolution | Frames | Architecture | |-------|------------|------------|--------|--------------| | TTV-1B (ours) | 1.0B | 256×256 | 16 | DiT | | Stable Diffusion Video | 1.7B | 512×512 | 25 | U-Net | | Make-A-Video | 9.7B | 256×256 | 16 | U-Net | | Imagen Video | 11B | 1280×768 | 128 | U-Net Cascade | ## Optimization Techniques 1. **Mixed Precision (FP16)** - Reduces memory by 50% - Faster computation on modern GPUs 2. **Gradient Accumulation** - Enables larger effective batch sizes - Improves training stability 3. **Gradient Checkpointing** - Trades computation for memory - Enables larger batch sizes 4. **Flash Attention** - O(N) memory instead of O(N²) - Faster attention computation ## Future Enhancements 1. **Higher Resolution**: 512×512 or 1024×1024 2. **Longer Videos**: 64 or 128 frames 3. **Better Text Encoding**: CLIP or T5 4. **Temporal Super-Resolution**: Increase frame rate 5. **Motion Control**: Add motion guidance 6. **Video Editing**: Inpainting, style transfer 7. **LoRA Fine-tuning**: Efficient adaptation 8. **Distillation**: Smaller, faster variants