Text2Sign: Lightweight Diffusion Model for Sign Language Video Generation
This repository contains the pretrained checkpoint and inference code for the Text2Sign model, a lightweight diffusion-based architecture for generating sign language videos from text prompts.
Model Overview
- Architecture: 3D UNet backbone with DiT (Diffusion Transformer) blocks and a custom Transformer-based text encoder.
- Dataset: Trained on How2Sign (ASL) video-text pairs.
- Resolution: 64x64 RGB, 16 frames per clip.
- Checkpoint: Provided at epoch 70.
Files
checkpoint_epoch_70.ptโ Pretrained model weightsconfig.pyโ Model and generation configurationinference.pyโ Example script for generating sign language videos from text
Usage
- Install dependencies:
pip install torch torchvision pillow matplotlib - Run the inference script:
This will generate a video for the given prompt and save a filmstrip image.python inference.py --prompt "Hello world"
License
MIT