TTV-1B Model Architecture Specification
Model Summary
Name: TTV-1B (Text-to-Video 1 Billion) Type: Diffusion Transformer for Text-to-Video Generation Total Parameters: 1,003,147,264 (~1.0 Billion)
Architecture Components
1. Text Encoder (50M parameters)
Input: Text tokens (batch_size, 256)
Architecture:
- Token Embedding: 50,257 vocab β 768 dim
- Position Embedding: 256 positions β 768 dim
- 6 Transformer Layers:
* Multi-head Attention (12 heads)
* Feed-forward (768 β 3072 β 768)
* Layer Normalization
Output: Text features (batch_size, 256, 768)
2. Text Projection Layer
Linear: 768 β 1536 dimensions
Purpose: Project text features to model hidden dimension
3. 3D Patch Embedding
Input: Video (batch_size, 3, 16, 256, 256)
Patch size: (2, 16, 16) - temporal Γ height Γ width
Conv3D: 3 channels β 1536 channels
Output: (batch_size, 128, 1536) where 128 = (16/2) Γ (256/16) Γ (256/16)
= 8 Γ 16 Γ 16
4. Positional Embedding
Learnable position embeddings for 128 patches
Shape: (1, 128, 1536)
5. Timestep Embedding
Sinusoidal timestep encoding β Linear(1536, 6144) β SiLU β Linear(6144, 1536)
Output: Conditioning vector (batch_size, 1536)
6. DiT Blocks (24 layers, 950M parameters)
Each block contains:
a) 3D Spatiotemporal Attention
- Query, Key, Value projections: Linear(1536, 4608)
- 24 attention heads (64 dimensions each)
- Rotary position embeddings on temporal dimension
- Scaled dot-product attention
- Output projection: Linear(1536, 1536)
b) Feed-Forward Network
- Linear: 1536 β 6144 (4x expansion)
- GELU activation
- Linear: 6144 β 1536
c) Adaptive Layer Normalization (AdaLN)
- Modulation network: SiLU β Linear(1536, 9216)
- Generates 6 modulation parameters:
* scale_msa, shift_msa, gate_msa (for attention)
* scale_mlp, shift_mlp, gate_mlp (for FFN)
7. Final Layer
- Adaptive LayerNorm
- Linear: 1536 β 768 (2Γ16Γ16Γ3)
Purpose: Map back to patch space
8. Unpatchify
Reshape patches back to video
(batch_size, 128, 768) β (batch_size, 3, 16, 256, 256)
Parameter Breakdown
| Component | Parameters | Percentage |
|---|---|---|
| Text Encoder | 50,331,648 | 5.0% |
| Text Projection | 1,180,416 | 0.1% |
| Patch Embedding | 589,824 | 0.1% |
| Position Embedding | 196,608 | 0.02% |
| Timestep Embedding | 14,157,312 | 1.4% |
| DiT Blocks (24Γ) | 927,711,744 | 92.5% |
| Final Layer | 8,979,712 | 0.9% |
| Total | 1,003,147,264 | 100% |
Per-Block Parameters (DiT)
Each of 24 DiT blocks contains ~38.7M parameters:
| Sub-component | Parameters |
|---|---|
| Attention QKV | 7,077,888 |
| Attention Proj | 2,362,368 |
| Rotary Embedding | 48 |
| FFN Layer 1 | 9,443,328 |
| FFN Layer 2 | 9,443,328 |
| AdaLN Modulation | 14,155,776 |
| Layer Norms | 0 (no learnable params) |
| Per Block Total | 38,654,656 |
Data Flow
1. Text Input (batch, 256 tokens)
β
2. Text Encoder (6 transformer layers)
β
3. Text Features (batch, 256, 768) β Pool β (batch, 768)
β
4. Project to 1536 dim β (batch, 1536)
β
5. Add Timestep Embedding β Conditioning (batch, 1536)
β
6. Video Input (batch, 3, 16, 256, 256)
β
7. 3D Patch Embed β (batch, 128, 1536)
β
8. Add Position Embedding
β
9. 24Γ DiT Blocks (with conditioning)
β
10. Final Layer + AdaLN
β
11. Unpatchify
β
12. Output: Predicted Noise (batch, 3, 16, 256, 256)
Memory Requirements
Model Weights
- FP32: ~4.0 GB
- FP16: ~2.0 GB
- INT8: ~1.0 GB
Activations (per sample, 256Γ256Γ16)
- Forward pass: ~8 GB (FP16)
- Backward pass: ~16 GB (FP16)
Training (batch_size=2, FP16, gradient accumulation=8)
- Model: 2 GB
- Optimizer states (AdamW): 4 GB
- Gradients: 2 GB
- Activations: 16 GB
- Total: ~24 GB per GPU
Inference (batch_size=1, FP16)
- Model: 2 GB
- Activations: 4 GB
- Total: ~6 GB
Computational Complexity
FLOPs per forward pass (approximate)
- Text Encoder: ~10 GFLOPs
- Patch Embedding: ~5 GFLOPs
- DiT Blocks (24Γ): ~4,800 GFLOPs
- Unpatchify: ~1 GFLOPs
- Total: ~4,816 GFLOPs per video
Training Speed Estimates
- Single A100 80GB: ~2-3 seconds per batch (batch_size=2)
- 8Γ A100 80GB: ~2-3 seconds per batch (batch_size=16)
Inference Speed Estimates
- A100 80GB (50 denoising steps): ~15-20 seconds per video
- RTX 4090 (50 denoising steps): ~25-35 seconds per video
Diffusion Scheduler
DDPM (Denoising Diffusion Probabilistic Model)
- Training steps: 1000
- Beta schedule: Linear (0.0001 β 0.02)
- Loss: MSE between predicted and actual noise
- Sampling: Iterative denoising from T=999 to T=0
Classifier-Free Guidance
- Unconditional dropout during training: 10%
- Guidance scale at inference: 7.5 (typical)
- Formula:
noise_pred = noise_uncond + guidance_scale Γ (noise_cond - noise_uncond)
Key Features
3D Spatiotemporal Attention
- Full attention across time, height, and width
- Captures motion dynamics and spatial relationships
Rotary Position Embeddings
- Applied to temporal dimension
- Better sequence modeling than learned embeddings
Adaptive Layer Normalization
- Conditions on text and timestep
- Allows flexible control over generation
Efficient Design
- Patch-based processing reduces sequence length
- Mixed precision training support
- Gradient checkpointing compatible
Comparison with Other Models
| Model | Parameters | Resolution | Frames | Architecture |
|---|---|---|---|---|
| TTV-1B (ours) | 1.0B | 256Γ256 | 16 | DiT |
| Stable Diffusion Video | 1.7B | 512Γ512 | 25 | U-Net |
| Make-A-Video | 9.7B | 256Γ256 | 16 | U-Net |
| Imagen Video | 11B | 1280Γ768 | 128 | U-Net Cascade |
Optimization Techniques
Mixed Precision (FP16)
- Reduces memory by 50%
- Faster computation on modern GPUs
Gradient Accumulation
- Enables larger effective batch sizes
- Improves training stability
Gradient Checkpointing
- Trades computation for memory
- Enables larger batch sizes
Flash Attention
- O(N) memory instead of O(NΒ²)
- Faster attention computation
Future Enhancements
- Higher Resolution: 512Γ512 or 1024Γ1024
- Longer Videos: 64 or 128 frames
- Better Text Encoding: CLIP or T5
- Temporal Super-Resolution: Increase frame rate
- Motion Control: Add motion guidance
- Video Editing: Inpainting, style transfer
- LoRA Fine-tuning: Efficient adaptation
- Distillation: Smaller, faster variants