File size: 6,793 Bytes

3d8856d

# TTV-1B Model Architecture Specification

## Model Summary

**Name:** TTV-1B (Text-to-Video 1 Billion)
**Type:** Diffusion Transformer for Text-to-Video Generation
**Total Parameters:** 1,003,147,264 (~1.0 Billion)

## Architecture Components

### 1. Text Encoder (50M parameters)
```
Input: Text tokens (batch_size, 256)
Architecture:
  - Token Embedding: 50,257 vocab → 768 dim
  - Position Embedding: 256 positions → 768 dim
  - 6 Transformer Layers:
    * Multi-head Attention (12 heads)
    * Feed-forward (768 → 3072 → 768)
    * Layer Normalization
Output: Text features (batch_size, 256, 768)
```

### 2. Text Projection Layer
```
Linear: 768 → 1536 dimensions
Purpose: Project text features to model hidden dimension
```

### 3. 3D Patch Embedding
```
Input: Video (batch_size, 3, 16, 256, 256)
Patch size: (2, 16, 16) - temporal × height × width
Conv3D: 3 channels → 1536 channels
Output: (batch_size, 128, 1536) where 128 = (16/2) × (256/16) × (256/16)
                                              = 8 × 16 × 16
```

### 4. Positional Embedding
```
Learnable position embeddings for 128 patches
Shape: (1, 128, 1536)
```

### 5. Timestep Embedding
```
Sinusoidal timestep encoding → Linear(1536, 6144) → SiLU → Linear(6144, 1536)
Output: Conditioning vector (batch_size, 1536)
```

### 6. DiT Blocks (24 layers, 950M parameters)

Each block contains:

#### a) 3D Spatiotemporal Attention
```
- Query, Key, Value projections: Linear(1536, 4608)
- 24 attention heads (64 dimensions each)
- Rotary position embeddings on temporal dimension
- Scaled dot-product attention
- Output projection: Linear(1536, 1536)
```

#### b) Feed-Forward Network
```
- Linear: 1536 → 6144 (4x expansion)
- GELU activation
- Linear: 6144 → 1536
```

#### c) Adaptive Layer Normalization (AdaLN)
```
- Modulation network: SiLU → Linear(1536, 9216)
- Generates 6 modulation parameters:
  * scale_msa, shift_msa, gate_msa (for attention)
  * scale_mlp, shift_mlp, gate_mlp (for FFN)
```

### 7. Final Layer
```
- Adaptive LayerNorm
- Linear: 1536 → 768 (2×16×16×3)
Purpose: Map back to patch space
```

### 8. Unpatchify
```
Reshape patches back to video
(batch_size, 128, 768) → (batch_size, 3, 16, 256, 256)
```

## Parameter Breakdown

| Component | Parameters | Percentage |
|-----------|------------|------------|
| Text Encoder | 50,331,648 | 5.0% |
| Text Projection | 1,180,416 | 0.1% |
| Patch Embedding | 589,824 | 0.1% |
| Position Embedding | 196,608 | 0.02% |
| Timestep Embedding | 14,157,312 | 1.4% |
| DiT Blocks (24×) | 927,711,744 | 92.5% |
| Final Layer | 8,979,712 | 0.9% |
| **Total** | **1,003,147,264** | **100%** |

## Per-Block Parameters (DiT)

Each of 24 DiT blocks contains ~38.7M parameters:

| Sub-component | Parameters |
|---------------|------------|
| Attention QKV | 7,077,888 |
| Attention Proj | 2,362,368 |
| Rotary Embedding | 48 |
| FFN Layer 1 | 9,443,328 |
| FFN Layer 2 | 9,443,328 |
| AdaLN Modulation | 14,155,776 |
| Layer Norms | 0 (no learnable params) |
| **Per Block Total** | **38,654,656** |

## Data Flow

```
1. Text Input (batch, 256 tokens)
   ↓
2. Text Encoder (6 transformer layers)
   ↓
3. Text Features (batch, 256, 768) → Pool → (batch, 768)
   ↓
4. Project to 1536 dim → (batch, 1536)
   ↓
5. Add Timestep Embedding → Conditioning (batch, 1536)
   ↓
6. Video Input (batch, 3, 16, 256, 256)
   ↓
7. 3D Patch Embed → (batch, 128, 1536)
   ↓
8. Add Position Embedding
   ↓
9. 24× DiT Blocks (with conditioning)
   ↓
10. Final Layer + AdaLN
    ↓
11. Unpatchify
    ↓
12. Output: Predicted Noise (batch, 3, 16, 256, 256)
```

## Memory Requirements

### Model Weights
- FP32: ~4.0 GB
- FP16: ~2.0 GB
- INT8: ~1.0 GB

### Activations (per sample, 256×256×16)
- Forward pass: ~8 GB (FP16)
- Backward pass: ~16 GB (FP16)

### Training (batch_size=2, FP16, gradient accumulation=8)
- Model: 2 GB
- Optimizer states (AdamW): 4 GB
- Gradients: 2 GB
- Activations: 16 GB
- **Total: ~24 GB per GPU**

### Inference (batch_size=1, FP16)
- Model: 2 GB
- Activations: 4 GB
- **Total: ~6 GB**

## Computational Complexity

### FLOPs per forward pass (approximate)
- Text Encoder: ~10 GFLOPs
- Patch Embedding: ~5 GFLOPs
- DiT Blocks (24×): ~4,800 GFLOPs
- Unpatchify: ~1 GFLOPs
- **Total: ~4,816 GFLOPs per video**

### Training Speed Estimates
- Single A100 80GB: ~2-3 seconds per batch (batch_size=2)
- 8× A100 80GB: ~2-3 seconds per batch (batch_size=16)

### Inference Speed Estimates
- A100 80GB (50 denoising steps): ~15-20 seconds per video
- RTX 4090 (50 denoising steps): ~25-35 seconds per video

## Diffusion Scheduler

### DDPM (Denoising Diffusion Probabilistic Model)
- Training steps: 1000
- Beta schedule: Linear (0.0001 → 0.02)
- Loss: MSE between predicted and actual noise
- Sampling: Iterative denoising from T=999 to T=0

### Classifier-Free Guidance
- Unconditional dropout during training: 10%
- Guidance scale at inference: 7.5 (typical)
- Formula: `noise_pred = noise_uncond + guidance_scale × (noise_cond - noise_uncond)`

## Key Features

1. **3D Spatiotemporal Attention**
   - Full attention across time, height, and width
   - Captures motion dynamics and spatial relationships

2. **Rotary Position Embeddings**
   - Applied to temporal dimension
   - Better sequence modeling than learned embeddings

3. **Adaptive Layer Normalization**
   - Conditions on text and timestep
   - Allows flexible control over generation

4. **Efficient Design**
   - Patch-based processing reduces sequence length
   - Mixed precision training support
   - Gradient checkpointing compatible

## Comparison with Other Models

| Model | Parameters | Resolution | Frames | Architecture |
|-------|------------|------------|--------|--------------|
| TTV-1B (ours) | 1.0B | 256×256 | 16 | DiT |
| Stable Diffusion Video | 1.7B | 512×512 | 25 | U-Net |
| Make-A-Video | 9.7B | 256×256 | 16 | U-Net |
| Imagen Video | 11B | 1280×768 | 128 | U-Net Cascade |

## Optimization Techniques

1. **Mixed Precision (FP16)**
   - Reduces memory by 50%
   - Faster computation on modern GPUs

2. **Gradient Accumulation**
   - Enables larger effective batch sizes
   - Improves training stability

3. **Gradient Checkpointing**
   - Trades computation for memory
   - Enables larger batch sizes

4. **Flash Attention**
   - O(N) memory instead of O(N²)
   - Faster attention computation

## Future Enhancements

1. **Higher Resolution**: 512×512 or 1024×1024
2. **Longer Videos**: 64 or 128 frames
3. **Better Text Encoding**: CLIP or T5
4. **Temporal Super-Resolution**: Increase frame rate
5. **Motion Control**: Add motion guidance
6. **Video Editing**: Inpainting, style transfer
7. **LoRA Fine-tuning**: Efficient adaptation
8. **Distillation**: Smaller, faster variants