metadata
license: mit
tags:
- sign-language
- diffusion
- text-to-video
- asl
- how2sign
- lightweight
metrics:
- fvd
Text2Sign: Lightweight Diffusion Model for Sign Language Video Generation
This repository contains the pretrained checkpoint and inference code for the Text2Sign model, a lightweight diffusion-based architecture for generating sign language videos from text prompts.
Model Overview
- Architecture: 3D UNet backbone with DiT (Diffusion Transformer) blocks and a custom Transformer-based text encoder.
- Dataset: Trained on How2Sign (ASL) video-text pairs.
- Resolution: 64x64 RGB, 16 frames per clip.
- Checkpoint: Provided at epoch 70.
Files
checkpoint_epoch_70.pt— Pretrained model weightsconfig.py— Model and generation configurationinference.py— Example script for generating sign language videos from text
Usage
- Install dependencies:
pip install torch torchvision pillow matplotlib - Run the inference script:
This will generate a video for the given prompt and save a filmstrip image.python inference.py --prompt "Hello world"
License
MIT