|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- sign-language |
|
|
- diffusion |
|
|
- text-to-video |
|
|
- asl |
|
|
- how2sign |
|
|
- lightweight |
|
|
metrics: |
|
|
- fvd |
|
|
--- |
|
|
|
|
|
# Text2Sign: Lightweight Diffusion Model for Sign Language Video Generation |
|
|
|
|
|
This repository contains the pretrained checkpoint and inference code for the Text2Sign model, a lightweight diffusion-based architecture for generating sign language videos from text prompts. |
|
|
|
|
|
## Model Overview |
|
|
- **Architecture:** 3D UNet backbone with DiT (Diffusion Transformer) blocks and a custom Transformer-based text encoder. |
|
|
- **Dataset:** Trained on How2Sign (ASL) video-text pairs. |
|
|
- **Resolution:** 64x64 RGB, 16 frames per clip. |
|
|
- **Checkpoint:** Provided at epoch 70. |
|
|
|
|
|
## Files |
|
|
- `checkpoint_epoch_70.pt` β Pretrained model weights |
|
|
- `config.py` β Model and generation configuration |
|
|
- `inference.py` β Example script for generating sign language videos from text |
|
|
|
|
|
## Usage |
|
|
1. Install dependencies: |
|
|
```bash |
|
|
pip install torch torchvision pillow matplotlib |
|
|
``` |
|
|
2. Run the inference script: |
|
|
```bash |
|
|
python inference.py --prompt "Hello world" |
|
|
``` |
|
|
This will generate a video for the given prompt and save a filmstrip image. |
|
|
|
|
|
|
|
|
## License |
|
|
MIT |
|
|
|