YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

TTV-1B: 1 Billion Parameter Text-to-Video Model

A state-of-the-art text-to-video generation model with 1 billion parameters, built using Diffusion Transformer (DiT) architecture with 3D spatiotemporal attention.

🎯 Model Overview

TTV-1B is a diffusion-based text-to-video model that generates high-quality 16-frame videos at 256x256 resolution from text prompts.

Architecture Highlights

  • Total Parameters: ~1.0 Billion
  • Architecture: Diffusion Transformer (DiT)
  • Text Encoder: 6-layer transformer (50M params)
  • Video Backbone: 24 DiT blocks with 1536 hidden dimensions (950M params)
  • Attention: 3D Spatiotemporal attention with rotary embeddings
  • Patch Size: 2Γ—16Γ—16 (temporal Γ— height Γ— width)
  • Output: 16 frames @ 256Γ—256 resolution

πŸ“‹ Features

βœ… Spatiotemporal 3D Attention - Captures both spatial and temporal dependencies βœ… Rotary Position Embeddings - Better positional encoding for sequences βœ… Adaptive Layer Normalization (AdaLN) - Conditional generation via modulation βœ… DDPM Diffusion Scheduler - Proven denoising approach βœ… Mixed Precision Training - Faster training with lower memory βœ… Gradient Accumulation - Train with large effective batch sizes βœ… Classifier-Free Guidance - Better prompt adherence during inference

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/ttv-1b.git
cd ttv-1b

# Install dependencies
pip install -r requirements.txt

Training

from train import Trainer
from video_ttv_1b import create_model

# Create model
device = 'cuda'
model = create_model(device)

# Create datasets (replace with your data)
train_dataset = YourVideoDataset(...)
val_dataset = YourVideoDataset(...)

# Initialize trainer
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    batch_size=2,
    gradient_accumulation_steps=8,
    mixed_precision=True,
    learning_rate=1e-4,
    num_epochs=100,
)

# Start training
trainer.train()

Or use the training script:

python train.py

Inference

from inference import generate_video_from_prompt

# Generate video
video = generate_video_from_prompt(
    prompt="A cat playing with a ball of yarn",
    checkpoint_path="checkpoints/checkpoint_best.pt",
    output_path="output.mp4",
    num_steps=50,
    guidance_scale=7.5,
)

Or use the command line:

python inference.py \
    --prompt "A serene sunset over the ocean" \
    --checkpoint checkpoints/checkpoint_best.pt \
    --output generated_video.mp4 \
    --steps 50 \
    --guidance 7.5

πŸ—οΈ Model Architecture

Input: Text Prompt + Random Noise Video
                ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Text Encoder (6L)     β”‚
    β”‚   768d, 12 heads        β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Text Projection       β”‚
    β”‚   768d β†’ 1536d          β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   3D Patch Embedding    β”‚
    β”‚   (2,16,16) patches     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   24Γ— DiT Blocks        β”‚
    β”‚   β€’ 3D Spatio-Temporal  β”‚
    β”‚     Attention (24 heads)β”‚
    β”‚   β€’ Rotary Embeddings   β”‚
    β”‚   β€’ AdaLN Modulation    β”‚
    β”‚   β€’ Feed-Forward Net    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Final Layer + AdaLN   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Unpatchify to Video   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                ↓
Output: Predicted Noise / Denoised Video

πŸ“Š Training Details

Recommended Training Setup

  • GPU: 8Γ— A100 80GB (or equivalent)
  • Batch Size: 2 per GPU
  • Gradient Accumulation: 8 steps
  • Effective Batch Size: 128
  • Learning Rate: 1e-4 with cosine decay
  • Optimizer: AdamW (Ξ²1=0.9, Ξ²2=0.999)
  • Weight Decay: 0.01
  • Mixed Precision: FP16
  • Training Steps: ~500K

Memory Requirements

  • Model: ~4GB (FP32), ~2GB (FP16)
  • Activations: ~8GB per sample (256Γ—256Γ—16)
  • Total per GPU: ~12-16GB with batch size 2

Training Time Estimates

  • Single A100 80GB: ~4-6 weeks for 500K steps
  • 8Γ— A100 80GB: ~4-7 days for 500K steps

🎨 Inference Examples

# Example 1: Basic generation
from inference import VideoGenerator, load_model
from video_ttv_1b import DDPMScheduler

model = load_model("checkpoints/best.pt")
scheduler = DDPMScheduler()
generator = VideoGenerator(model, scheduler)

video = generator.generate(
    prompt="A beautiful waterfall in a lush forest",
    num_inference_steps=50,
)

# Example 2: Batch generation
from inference import batch_generate

prompts = [
    "A dog running in a park",
    "Fireworks in the night sky",
    "Ocean waves crashing on rocks",
]

batch_generate(
    prompts=prompts,
    checkpoint_path="checkpoints/best.pt",
    output_dir="./outputs",
    num_steps=50,
)

πŸ“ˆ Performance Metrics

Metric Value
Parameters 1.0B
FLOPs (per frame) ~250 GFLOPs
Inference Time (50 steps, A100) ~15-20 seconds
Training Loss (final) ~0.05 MSE
Video Quality (FVD) TBD

πŸ”§ Hyperparameters

Model Configuration

VideoTTV1B(
    img_size=(256, 256),           # Output resolution
    num_frames=16,                 # Video length
    patch_size=(2, 16, 16),        # Patch dimensions
    in_channels=3,                 # RGB
    hidden_dim=1536,               # Model width
    depth=24,                      # Number of layers
    num_heads=24,                  # Attention heads
    mlp_ratio=4.0,                 # MLP expansion
    text_dim=768,                  # Text encoder dim
    vocab_size=50257,              # Vocabulary size
)

Training Configuration

Trainer(
    batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    weight_decay=0.01,
    num_epochs=100,
    mixed_precision=True,
)

πŸ“ Project Structure

ttv-1b/
β”œβ”€β”€ video_ttv_1b.py      # Model architecture
β”œβ”€β”€ train.py             # Training script
β”œβ”€β”€ inference.py         # Inference & generation
β”œβ”€β”€ requirements.txt     # Dependencies
β”œβ”€β”€ README.md           # Documentation
β”œβ”€β”€ checkpoints/        # Model checkpoints
β”œβ”€β”€ data/              # Training data
└── outputs/           # Generated videos

πŸ”¬ Technical Details

3D Spatiotemporal Attention

The model uses full 3D attention across time, height, and width dimensions:

  • Captures motion dynamics and spatial relationships
  • Rotary position embeddings for better sequence modeling
  • Efficient implementation with Flash Attention compatible design

Diffusion Process

  1. Training: Learn to predict noise added to videos
  2. Inference: Iteratively denoise random noise β†’ video
  3. Guidance: Classifier-free guidance for better text alignment

Adaptive Layer Normalization

Each DiT block uses AdaLN-Zero for conditional generation:

  • Text and timestep embeddings modulate layer norm parameters
  • Allows model to adapt behavior based on conditioning

🎯 Use Cases

  • Creative Content: Generate videos for social media, marketing
  • Prototyping: Quick video mockups from descriptions
  • Education: Visualize concepts and scenarios
  • Entertainment: Generate animations and effects
  • Research: Study video generation and diffusion models

⚠️ Limitations

  • Maximum 16 frames (can be extended in future versions)
  • 256Γ—256 resolution (trade-off for 1B parameters)
  • Requires significant compute for training
  • Text encoder is simple (can be replaced with CLIP/T5)
  • No temporal super-resolution (yet)

🚧 Future Improvements

  • Increase resolution to 512Γ—512
  • Extend to 64+ frames
  • Add temporal super-resolution
  • Integrate CLIP text encoder
  • Add motion control
  • Implement video editing capabilities
  • Optimize inference speed
  • Add LoRA fine-tuning support

πŸ“š Citation

If you use this model in your research, please cite:

@misc{ttv1b2024,
  title={TTV-1B: A 1 Billion Parameter Text-to-Video Model},
  author={Your Name},
  year={2024},
  url={https://github.com/yourusername/ttv-1b}
}

πŸ“„ License

This project is licensed under the MIT License - see LICENSE file for details.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ’¬ Contact

For questions and feedback:

πŸ™ Acknowledgments

  • Inspired by DiT (Diffusion Transformer) architecture
  • Built with PyTorch and modern deep learning practices
  • Thanks to the open-source ML community

Status: Research/Educational Model | Version: 1.0.0 | Last Updated: 2024

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support