Text2Sign: Lightweight Diffusion Model for Sign Language Video Generation

This repository contains the pretrained checkpoint and inference code for the Text2Sign model, a lightweight diffusion-based architecture for generating sign language videos from text prompts.

Model Overview

Architecture: 3D UNet backbone with DiT (Diffusion Transformer) blocks and a custom Transformer-based text encoder.
Dataset: Trained on How2Sign (ASL) video-text pairs.
Resolution: 64x64 RGB, 16 frames per clip.
Checkpoint: Provided at epoch 70.

Files

checkpoint_epoch_70.pt — Pretrained model weights
config.py — Model and generation configuration
inference.py — Example script for generating sign language videos from text

Usage

Install dependencies:

pip install torch torchvision pillow matplotlib

Run the inference script:
```
python inference.py --prompt "Hello world"
```
This will generate a video for the given prompt and save a filmstrip image.

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track