| license: mit | |
| tags: | |
| - sign-language | |
| - diffusion | |
| - text-to-video | |
| - asl | |
| - how2sign | |
| - lightweight | |
| metrics: | |
| - fvd | |
| # Text2Sign: Lightweight Diffusion Model for Sign Language Video Generation | |
| This repository contains the pretrained checkpoint and inference code for the Text2Sign model, a lightweight diffusion-based architecture for generating sign language videos from text prompts. | |
| ## Model Overview | |
| - **Architecture:** 3D UNet backbone with DiT (Diffusion Transformer) blocks and a custom Transformer-based text encoder. | |
| - **Dataset:** Trained on How2Sign (ASL) video-text pairs. | |
| - **Resolution:** 64x64 RGB, 16 frames per clip. | |
| - **Checkpoint:** Provided at epoch 70. | |
| ## Files | |
| - `checkpoint_epoch_70.pt` — Pretrained model weights | |
| - `config.py` — Model and generation configuration | |
| - `inference.py` — Example script for generating sign language videos from text | |
| ## Usage | |
| 1. Install dependencies: | |
| ```bash | |
| pip install torch torchvision pillow matplotlib | |
| ``` | |
| 2. Run the inference script: | |
| ```bash | |
| python inference.py --prompt "Hello world" | |
| ``` | |
| This will generate a video for the given prompt and save a filmstrip image. | |
| ## License | |
| MIT | |