Zenderos / ARCHITECTURE.md

Upload 11 files

3d8856d verified 24 days ago

6.79 kB

	# TTV-1B Model Architecture Specification

	## Model Summary

	Name: TTV-1B (Text-to-Video 1 Billion)
	Type: Diffusion Transformer for Text-to-Video Generation
	Total Parameters: 1,003,147,264 (~1.0 Billion)

	## Architecture Components

	### 1. Text Encoder (50M parameters)
	```
	Input: Text tokens (batch_size, 256)
	Architecture:
	- Token Embedding: 50,257 vocab → 768 dim
	- Position Embedding: 256 positions → 768 dim
	- 6 Transformer Layers:
	* Multi-head Attention (12 heads)
	* Feed-forward (768 → 3072 → 768)
	* Layer Normalization
	Output: Text features (batch_size, 256, 768)
	```

	### 2. Text Projection Layer
	```
	Linear: 768 → 1536 dimensions
	Purpose: Project text features to model hidden dimension
	```

	### 3. 3D Patch Embedding
	```
	Input: Video (batch_size, 3, 16, 256, 256)
	Patch size: (2, 16, 16) - temporal × height × width
	Conv3D: 3 channels → 1536 channels
	Output: (batch_size, 128, 1536) where 128 = (16/2) × (256/16) × (256/16)
	= 8 × 16 × 16
	```

	### 4. Positional Embedding
	```
	Learnable position embeddings for 128 patches
	Shape: (1, 128, 1536)
	```

	### 5. Timestep Embedding
	```
	Sinusoidal timestep encoding → Linear(1536, 6144) → SiLU → Linear(6144, 1536)
	Output: Conditioning vector (batch_size, 1536)
	```

	### 6. DiT Blocks (24 layers, 950M parameters)

	Each block contains:

	#### a) 3D Spatiotemporal Attention
	```
	- Query, Key, Value projections: Linear(1536, 4608)
	- 24 attention heads (64 dimensions each)
	- Rotary position embeddings on temporal dimension
	- Scaled dot-product attention
	- Output projection: Linear(1536, 1536)
	```

	#### b) Feed-Forward Network
	```
	- Linear: 1536 → 6144 (4x expansion)
	- GELU activation
	- Linear: 6144 → 1536
	```

	#### c) Adaptive Layer Normalization (AdaLN)
	```
	- Modulation network: SiLU → Linear(1536, 9216)
	- Generates 6 modulation parameters:
	* scale_msa, shift_msa, gate_msa (for attention)
	* scale_mlp, shift_mlp, gate_mlp (for FFN)
	```

	### 7. Final Layer
	```
	- Adaptive LayerNorm
	- Linear: 1536 → 768 (2×16×16×3)
	Purpose: Map back to patch space
	```

	### 8. Unpatchify
	```
	Reshape patches back to video
	(batch_size, 128, 768) → (batch_size, 3, 16, 256, 256)
	```

	## Parameter Breakdown

	\| Component \| Parameters \| Percentage \|
	\|-----------\|------------\|------------\|
	\| Text Encoder \| 50,331,648 \| 5.0% \|
	\| Text Projection \| 1,180,416 \| 0.1% \|
	\| Patch Embedding \| 589,824 \| 0.1% \|
	\| Position Embedding \| 196,608 \| 0.02% \|
	\| Timestep Embedding \| 14,157,312 \| 1.4% \|
	\| DiT Blocks (24×) \| 927,711,744 \| 92.5% \|
	\| Final Layer \| 8,979,712 \| 0.9% \|
	\| Total \| 1,003,147,264 \| 100% \|

	## Per-Block Parameters (DiT)

	Each of 24 DiT blocks contains ~38.7M parameters:

	\| Sub-component \| Parameters \|
	\|---------------\|------------\|
	\| Attention QKV \| 7,077,888 \|
	\| Attention Proj \| 2,362,368 \|
	\| Rotary Embedding \| 48 \|
	\| FFN Layer 1 \| 9,443,328 \|
	\| FFN Layer 2 \| 9,443,328 \|
	\| AdaLN Modulation \| 14,155,776 \|
	\| Layer Norms \| 0 (no learnable params) \|
	\| Per Block Total \| 38,654,656 \|

	## Data Flow

	```
	1. Text Input (batch, 256 tokens)
	↓
	2. Text Encoder (6 transformer layers)
	↓
	3. Text Features (batch, 256, 768) → Pool → (batch, 768)
	↓
	4. Project to 1536 dim → (batch, 1536)
	↓
	5. Add Timestep Embedding → Conditioning (batch, 1536)
	↓
	6. Video Input (batch, 3, 16, 256, 256)
	↓
	7. 3D Patch Embed → (batch, 128, 1536)
	↓
	8. Add Position Embedding
	↓
	9. 24× DiT Blocks (with conditioning)
	↓
	10. Final Layer + AdaLN
	↓
	11. Unpatchify
	↓
	12. Output: Predicted Noise (batch, 3, 16, 256, 256)
	```

	## Memory Requirements

	### Model Weights
	- FP32: ~4.0 GB
	- FP16: ~2.0 GB
	- INT8: ~1.0 GB

	### Activations (per sample, 256×256×16)
	- Forward pass: ~8 GB (FP16)
	- Backward pass: ~16 GB (FP16)

	### Training (batch_size=2, FP16, gradient accumulation=8)
	- Model: 2 GB
	- Optimizer states (AdamW): 4 GB
	- Gradients: 2 GB
	- Activations: 16 GB
	- Total: ~24 GB per GPU

	### Inference (batch_size=1, FP16)
	- Model: 2 GB
	- Activations: 4 GB
	- Total: ~6 GB

	## Computational Complexity

	### FLOPs per forward pass (approximate)
	- Text Encoder: ~10 GFLOPs
	- Patch Embedding: ~5 GFLOPs
	- DiT Blocks (24×): ~4,800 GFLOPs
	- Unpatchify: ~1 GFLOPs
	- Total: ~4,816 GFLOPs per video

	### Training Speed Estimates
	- Single A100 80GB: ~2-3 seconds per batch (batch_size=2)
	- 8× A100 80GB: ~2-3 seconds per batch (batch_size=16)

	### Inference Speed Estimates
	- A100 80GB (50 denoising steps): ~15-20 seconds per video
	- RTX 4090 (50 denoising steps): ~25-35 seconds per video

	## Diffusion Scheduler

	### DDPM (Denoising Diffusion Probabilistic Model)
	- Training steps: 1000
	- Beta schedule: Linear (0.0001 → 0.02)
	- Loss: MSE between predicted and actual noise
	- Sampling: Iterative denoising from T=999 to T=0

	### Classifier-Free Guidance
	- Unconditional dropout during training: 10%
	- Guidance scale at inference: 7.5 (typical)
	- Formula: `noise_pred = noise_uncond + guidance_scale × (noise_cond - noise_uncond)`

	## Key Features

	1. 3D Spatiotemporal Attention
	- Full attention across time, height, and width
	- Captures motion dynamics and spatial relationships

	2. Rotary Position Embeddings
	- Applied to temporal dimension
	- Better sequence modeling than learned embeddings

	3. Adaptive Layer Normalization
	- Conditions on text and timestep
	- Allows flexible control over generation

	4. Efficient Design
	- Patch-based processing reduces sequence length
	- Mixed precision training support
	- Gradient checkpointing compatible

	## Comparison with Other Models

	\| Model \| Parameters \| Resolution \| Frames \| Architecture \|
	\|-------\|------------\|------------\|--------\|--------------\|
	\| TTV-1B (ours) \| 1.0B \| 256×256 \| 16 \| DiT \|
	\| Stable Diffusion Video \| 1.7B \| 512×512 \| 25 \| U-Net \|
	\| Make-A-Video \| 9.7B \| 256×256 \| 16 \| U-Net \|
	\| Imagen Video \| 11B \| 1280×768 \| 128 \| U-Net Cascade \|

	## Optimization Techniques

	1. Mixed Precision (FP16)
	- Reduces memory by 50%
	- Faster computation on modern GPUs

	2. Gradient Accumulation
	- Enables larger effective batch sizes
	- Improves training stability

	3. Gradient Checkpointing
	- Trades computation for memory
	- Enables larger batch sizes

	4. Flash Attention
	- O(N) memory instead of O(N²)
	- Faster attention computation

	## Future Enhancements

	1. Higher Resolution: 512×512 or 1024×1024
	2. Longer Videos: 64 or 128 frames
	3. Better Text Encoding: CLIP or T5
	4. Temporal Super-Resolution: Increase frame rate
	5. Motion Control: Add motion guidance
	6. Video Editing: Inpainting, style transfer
	7. LoRA Fine-tuning: Efficient adaptation
	8. Distillation: Smaller, faster variants