File size: 1,171 Bytes
cec2397 5e9417b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
---
license: mit
tags:
- sign-language
- diffusion
- text-to-video
- asl
- how2sign
- lightweight
metrics:
- fvd
---
# Text2Sign: Lightweight Diffusion Model for Sign Language Video Generation
This repository contains the pretrained checkpoint and inference code for the Text2Sign model, a lightweight diffusion-based architecture for generating sign language videos from text prompts.
## Model Overview
- **Architecture:** 3D UNet backbone with DiT (Diffusion Transformer) blocks and a custom Transformer-based text encoder.
- **Dataset:** Trained on How2Sign (ASL) video-text pairs.
- **Resolution:** 64x64 RGB, 16 frames per clip.
- **Checkpoint:** Provided at epoch 70.
## Files
- `checkpoint_epoch_70.pt` — Pretrained model weights
- `config.py` — Model and generation configuration
- `inference.py` — Example script for generating sign language videos from text
## Usage
1. Install dependencies:
```bash
pip install torch torchvision pillow matplotlib
```
2. Run the inference script:
```bash
python inference.py --prompt "Hello world"
```
This will generate a video for the given prompt and save a filmstrip image.
## License
MIT
|