| # TTV-1B Model Architecture Specification | |
| ## Model Summary | |
| **Name:** TTV-1B (Text-to-Video 1 Billion) | |
| **Type:** Diffusion Transformer for Text-to-Video Generation | |
| **Total Parameters:** 1,003,147,264 (~1.0 Billion) | |
| ## Architecture Components | |
| ### 1. Text Encoder (50M parameters) | |
| ``` | |
| Input: Text tokens (batch_size, 256) | |
| Architecture: | |
| - Token Embedding: 50,257 vocab β 768 dim | |
| - Position Embedding: 256 positions β 768 dim | |
| - 6 Transformer Layers: | |
| * Multi-head Attention (12 heads) | |
| * Feed-forward (768 β 3072 β 768) | |
| * Layer Normalization | |
| Output: Text features (batch_size, 256, 768) | |
| ``` | |
| ### 2. Text Projection Layer | |
| ``` | |
| Linear: 768 β 1536 dimensions | |
| Purpose: Project text features to model hidden dimension | |
| ``` | |
| ### 3. 3D Patch Embedding | |
| ``` | |
| Input: Video (batch_size, 3, 16, 256, 256) | |
| Patch size: (2, 16, 16) - temporal Γ height Γ width | |
| Conv3D: 3 channels β 1536 channels | |
| Output: (batch_size, 128, 1536) where 128 = (16/2) Γ (256/16) Γ (256/16) | |
| = 8 Γ 16 Γ 16 | |
| ``` | |
| ### 4. Positional Embedding | |
| ``` | |
| Learnable position embeddings for 128 patches | |
| Shape: (1, 128, 1536) | |
| ``` | |
| ### 5. Timestep Embedding | |
| ``` | |
| Sinusoidal timestep encoding β Linear(1536, 6144) β SiLU β Linear(6144, 1536) | |
| Output: Conditioning vector (batch_size, 1536) | |
| ``` | |
| ### 6. DiT Blocks (24 layers, 950M parameters) | |
| Each block contains: | |
| #### a) 3D Spatiotemporal Attention | |
| ``` | |
| - Query, Key, Value projections: Linear(1536, 4608) | |
| - 24 attention heads (64 dimensions each) | |
| - Rotary position embeddings on temporal dimension | |
| - Scaled dot-product attention | |
| - Output projection: Linear(1536, 1536) | |
| ``` | |
| #### b) Feed-Forward Network | |
| ``` | |
| - Linear: 1536 β 6144 (4x expansion) | |
| - GELU activation | |
| - Linear: 6144 β 1536 | |
| ``` | |
| #### c) Adaptive Layer Normalization (AdaLN) | |
| ``` | |
| - Modulation network: SiLU β Linear(1536, 9216) | |
| - Generates 6 modulation parameters: | |
| * scale_msa, shift_msa, gate_msa (for attention) | |
| * scale_mlp, shift_mlp, gate_mlp (for FFN) | |
| ``` | |
| ### 7. Final Layer | |
| ``` | |
| - Adaptive LayerNorm | |
| - Linear: 1536 β 768 (2Γ16Γ16Γ3) | |
| Purpose: Map back to patch space | |
| ``` | |
| ### 8. Unpatchify | |
| ``` | |
| Reshape patches back to video | |
| (batch_size, 128, 768) β (batch_size, 3, 16, 256, 256) | |
| ``` | |
| ## Parameter Breakdown | |
| | Component | Parameters | Percentage | | |
| |-----------|------------|------------| | |
| | Text Encoder | 50,331,648 | 5.0% | | |
| | Text Projection | 1,180,416 | 0.1% | | |
| | Patch Embedding | 589,824 | 0.1% | | |
| | Position Embedding | 196,608 | 0.02% | | |
| | Timestep Embedding | 14,157,312 | 1.4% | | |
| | DiT Blocks (24Γ) | 927,711,744 | 92.5% | | |
| | Final Layer | 8,979,712 | 0.9% | | |
| | **Total** | **1,003,147,264** | **100%** | | |
| ## Per-Block Parameters (DiT) | |
| Each of 24 DiT blocks contains ~38.7M parameters: | |
| | Sub-component | Parameters | | |
| |---------------|------------| | |
| | Attention QKV | 7,077,888 | | |
| | Attention Proj | 2,362,368 | | |
| | Rotary Embedding | 48 | | |
| | FFN Layer 1 | 9,443,328 | | |
| | FFN Layer 2 | 9,443,328 | | |
| | AdaLN Modulation | 14,155,776 | | |
| | Layer Norms | 0 (no learnable params) | | |
| | **Per Block Total** | **38,654,656** | | |
| ## Data Flow | |
| ``` | |
| 1. Text Input (batch, 256 tokens) | |
| β | |
| 2. Text Encoder (6 transformer layers) | |
| β | |
| 3. Text Features (batch, 256, 768) β Pool β (batch, 768) | |
| β | |
| 4. Project to 1536 dim β (batch, 1536) | |
| β | |
| 5. Add Timestep Embedding β Conditioning (batch, 1536) | |
| β | |
| 6. Video Input (batch, 3, 16, 256, 256) | |
| β | |
| 7. 3D Patch Embed β (batch, 128, 1536) | |
| β | |
| 8. Add Position Embedding | |
| β | |
| 9. 24Γ DiT Blocks (with conditioning) | |
| β | |
| 10. Final Layer + AdaLN | |
| β | |
| 11. Unpatchify | |
| β | |
| 12. Output: Predicted Noise (batch, 3, 16, 256, 256) | |
| ``` | |
| ## Memory Requirements | |
| ### Model Weights | |
| - FP32: ~4.0 GB | |
| - FP16: ~2.0 GB | |
| - INT8: ~1.0 GB | |
| ### Activations (per sample, 256Γ256Γ16) | |
| - Forward pass: ~8 GB (FP16) | |
| - Backward pass: ~16 GB (FP16) | |
| ### Training (batch_size=2, FP16, gradient accumulation=8) | |
| - Model: 2 GB | |
| - Optimizer states (AdamW): 4 GB | |
| - Gradients: 2 GB | |
| - Activations: 16 GB | |
| - **Total: ~24 GB per GPU** | |
| ### Inference (batch_size=1, FP16) | |
| - Model: 2 GB | |
| - Activations: 4 GB | |
| - **Total: ~6 GB** | |
| ## Computational Complexity | |
| ### FLOPs per forward pass (approximate) | |
| - Text Encoder: ~10 GFLOPs | |
| - Patch Embedding: ~5 GFLOPs | |
| - DiT Blocks (24Γ): ~4,800 GFLOPs | |
| - Unpatchify: ~1 GFLOPs | |
| - **Total: ~4,816 GFLOPs per video** | |
| ### Training Speed Estimates | |
| - Single A100 80GB: ~2-3 seconds per batch (batch_size=2) | |
| - 8Γ A100 80GB: ~2-3 seconds per batch (batch_size=16) | |
| ### Inference Speed Estimates | |
| - A100 80GB (50 denoising steps): ~15-20 seconds per video | |
| - RTX 4090 (50 denoising steps): ~25-35 seconds per video | |
| ## Diffusion Scheduler | |
| ### DDPM (Denoising Diffusion Probabilistic Model) | |
| - Training steps: 1000 | |
| - Beta schedule: Linear (0.0001 β 0.02) | |
| - Loss: MSE between predicted and actual noise | |
| - Sampling: Iterative denoising from T=999 to T=0 | |
| ### Classifier-Free Guidance | |
| - Unconditional dropout during training: 10% | |
| - Guidance scale at inference: 7.5 (typical) | |
| - Formula: `noise_pred = noise_uncond + guidance_scale Γ (noise_cond - noise_uncond)` | |
| ## Key Features | |
| 1. **3D Spatiotemporal Attention** | |
| - Full attention across time, height, and width | |
| - Captures motion dynamics and spatial relationships | |
| 2. **Rotary Position Embeddings** | |
| - Applied to temporal dimension | |
| - Better sequence modeling than learned embeddings | |
| 3. **Adaptive Layer Normalization** | |
| - Conditions on text and timestep | |
| - Allows flexible control over generation | |
| 4. **Efficient Design** | |
| - Patch-based processing reduces sequence length | |
| - Mixed precision training support | |
| - Gradient checkpointing compatible | |
| ## Comparison with Other Models | |
| | Model | Parameters | Resolution | Frames | Architecture | | |
| |-------|------------|------------|--------|--------------| | |
| | TTV-1B (ours) | 1.0B | 256Γ256 | 16 | DiT | | |
| | Stable Diffusion Video | 1.7B | 512Γ512 | 25 | U-Net | | |
| | Make-A-Video | 9.7B | 256Γ256 | 16 | U-Net | | |
| | Imagen Video | 11B | 1280Γ768 | 128 | U-Net Cascade | | |
| ## Optimization Techniques | |
| 1. **Mixed Precision (FP16)** | |
| - Reduces memory by 50% | |
| - Faster computation on modern GPUs | |
| 2. **Gradient Accumulation** | |
| - Enables larger effective batch sizes | |
| - Improves training stability | |
| 3. **Gradient Checkpointing** | |
| - Trades computation for memory | |
| - Enables larger batch sizes | |
| 4. **Flash Attention** | |
| - O(N) memory instead of O(NΒ²) | |
| - Faster attention computation | |
| ## Future Enhancements | |
| 1. **Higher Resolution**: 512Γ512 or 1024Γ1024 | |
| 2. **Longer Videos**: 64 or 128 frames | |
| 3. **Better Text Encoding**: CLIP or T5 | |
| 4. **Temporal Super-Resolution**: Increase frame rate | |
| 5. **Motion Control**: Add motion guidance | |
| 6. **Video Editing**: Inpainting, style transfer | |
| 7. **LoRA Fine-tuning**: Efficient adaptation | |
| 8. **Distillation**: Smaller, faster variants | |