Zenderos / ARCHITECTURE.md

ASADSANAN

Upload 11 files

3d8856d verified 24 days ago

preview code

raw

history blame contribute delete

6.79 kB

TTV-1B Model Architecture Specification

Model Summary

Name: TTV-1B (Text-to-Video 1 Billion) Type: Diffusion Transformer for Text-to-Video Generation Total Parameters: 1,003,147,264 (~1.0 Billion)

Architecture Components

1. Text Encoder (50M parameters)

Input: Text tokens (batch_size, 256)
Architecture:
  - Token Embedding: 50,257 vocab → 768 dim
  - Position Embedding: 256 positions → 768 dim
  - 6 Transformer Layers:
    * Multi-head Attention (12 heads)
    * Feed-forward (768 → 3072 → 768)
    * Layer Normalization
Output: Text features (batch_size, 256, 768)

2. Text Projection Layer

Linear: 768 → 1536 dimensions
Purpose: Project text features to model hidden dimension

3. 3D Patch Embedding

Input: Video (batch_size, 3, 16, 256, 256)
Patch size: (2, 16, 16) - temporal × height × width
Conv3D: 3 channels → 1536 channels
Output: (batch_size, 128, 1536) where 128 = (16/2) × (256/16) × (256/16)
                                              = 8 × 16 × 16

4. Positional Embedding

Learnable position embeddings for 128 patches
Shape: (1, 128, 1536)

5. Timestep Embedding

Sinusoidal timestep encoding → Linear(1536, 6144) → SiLU → Linear(6144, 1536)
Output: Conditioning vector (batch_size, 1536)

6. DiT Blocks (24 layers, 950M parameters)

Each block contains:

a) 3D Spatiotemporal Attention

- Query, Key, Value projections: Linear(1536, 4608)
- 24 attention heads (64 dimensions each)
- Rotary position embeddings on temporal dimension
- Scaled dot-product attention
- Output projection: Linear(1536, 1536)

b) Feed-Forward Network

- Linear: 1536 → 6144 (4x expansion)
- GELU activation
- Linear: 6144 → 1536

c) Adaptive Layer Normalization (AdaLN)

- Modulation network: SiLU → Linear(1536, 9216)
- Generates 6 modulation parameters:
  * scale_msa, shift_msa, gate_msa (for attention)
  * scale_mlp, shift_mlp, gate_mlp (for FFN)

7. Final Layer

- Adaptive LayerNorm
- Linear: 1536 → 768 (2×16×16×3)
Purpose: Map back to patch space

8. Unpatchify

Reshape patches back to video
(batch_size, 128, 768) → (batch_size, 3, 16, 256, 256)

Parameter Breakdown

Component	Parameters	Percentage
Text Encoder	50,331,648	5.0%
Text Projection	1,180,416	0.1%
Patch Embedding	589,824	0.1%
Position Embedding	196,608	0.02%
Timestep Embedding	14,157,312	1.4%
DiT Blocks (24×)	927,711,744	92.5%
Final Layer	8,979,712	0.9%
Total	1,003,147,264	100%

Per-Block Parameters (DiT)

Each of 24 DiT blocks contains ~38.7M parameters:

Sub-component	Parameters
Attention QKV	7,077,888
Attention Proj	2,362,368
Rotary Embedding	48
FFN Layer 1	9,443,328
FFN Layer 2	9,443,328
AdaLN Modulation	14,155,776
Layer Norms	0 (no learnable params)
Per Block Total	38,654,656

Data Flow

1. Text Input (batch, 256 tokens)
   ↓
2. Text Encoder (6 transformer layers)
   ↓
3. Text Features (batch, 256, 768) → Pool → (batch, 768)
   ↓
4. Project to 1536 dim → (batch, 1536)
   ↓
5. Add Timestep Embedding → Conditioning (batch, 1536)
   ↓
6. Video Input (batch, 3, 16, 256, 256)
   ↓
7. 3D Patch Embed → (batch, 128, 1536)
   ↓
8. Add Position Embedding
   ↓
9. 24× DiT Blocks (with conditioning)
   ↓
10. Final Layer + AdaLN
    ↓
11. Unpatchify
    ↓
12. Output: Predicted Noise (batch, 3, 16, 256, 256)

Memory Requirements

Model Weights

FP32: ~4.0 GB
FP16: ~2.0 GB
INT8: ~1.0 GB

Activations (per sample, 256×256×16)

Forward pass: ~8 GB (FP16)
Backward pass: ~16 GB (FP16)

Training (batch_size=2, FP16, gradient accumulation=8)

Model: 2 GB
Optimizer states (AdamW): 4 GB
Gradients: 2 GB
Activations: 16 GB
Total: ~24 GB per GPU

Inference (batch_size=1, FP16)

Model: 2 GB
Activations: 4 GB
Total: ~6 GB

Computational Complexity

FLOPs per forward pass (approximate)

Text Encoder: ~10 GFLOPs
Patch Embedding: ~5 GFLOPs
DiT Blocks (24×): ~4,800 GFLOPs
Unpatchify: ~1 GFLOPs
Total: ~4,816 GFLOPs per video

Training Speed Estimates

Single A100 80GB: ~2-3 seconds per batch (batch_size=2)
8× A100 80GB: ~2-3 seconds per batch (batch_size=16)

Inference Speed Estimates

A100 80GB (50 denoising steps): ~15-20 seconds per video
RTX 4090 (50 denoising steps): ~25-35 seconds per video

Diffusion Scheduler

DDPM (Denoising Diffusion Probabilistic Model)

Training steps: 1000
Beta schedule: Linear (0.0001 → 0.02)
Loss: MSE between predicted and actual noise
Sampling: Iterative denoising from T=999 to T=0

Classifier-Free Guidance

Unconditional dropout during training: 10%
Guidance scale at inference: 7.5 (typical)
Formula: noise_pred = noise_uncond + guidance_scale × (noise_cond - noise_uncond)

Key Features

3D Spatiotemporal Attention
- Full attention across time, height, and width
- Captures motion dynamics and spatial relationships
Rotary Position Embeddings
- Applied to temporal dimension
- Better sequence modeling than learned embeddings
Adaptive Layer Normalization
- Conditions on text and timestep
- Allows flexible control over generation
Efficient Design
- Patch-based processing reduces sequence length
- Mixed precision training support
- Gradient checkpointing compatible

Comparison with Other Models

Model	Parameters	Resolution	Frames	Architecture
TTV-1B (ours)	1.0B	256×256	16	DiT
Stable Diffusion Video	1.7B	512×512	25	U-Net
Make-A-Video	9.7B	256×256	16	U-Net
Imagen Video	11B	1280×768	128	U-Net Cascade

Optimization Techniques

Mixed Precision (FP16)
- Reduces memory by 50%
- Faster computation on modern GPUs
Gradient Accumulation
- Enables larger effective batch sizes
- Improves training stability
Gradient Checkpointing
- Trades computation for memory
- Enables larger batch sizes
Flash Attention
- O(N) memory instead of O(N²)
- Faster attention computation

Future Enhancements

Higher Resolution: 512×512 or 1024×1024
Longer Videos: 64 or 128 frames
Better Text Encoding: CLIP or T5
Temporal Super-Resolution: Increase frame rate
Motion Control: Add motion guidance
Video Editing: Inpainting, style transfer
LoRA Fine-tuning: Efficient adaptation
Distillation: Smaller, faster variants