ArSL-Models โ Arabic Sign Language Recognition
Two PyTorch models for Arabic Sign Language (ArSL) recognition, used by the CSLR app (deployed as a Hugging Face Space):
| File | Task | Architecture | Classes |
|---|---|---|---|
improved_arsl_model.pth |
Alphabet (finger-spelling) | ResNet18 + BiLSTM + Attention | 29 |
sign_word_t5_classifier_best_3d.pth |
Word signs | T5-encoder over hand landmarks | 10 |
โ ๏ธ Note: Quantitative results below (accuracy, F1, dataset sizes) are placeholders pending the accompanying paper. They will be updated with the exact reported figures.
1. Alphabet model โ improved_arsl_model.pth
A spatial-sequential classifier for static Arabic letter signs.
Architecture (ArSLAttentionLSTM)
- Backbone: ResNet18 (ImageNet-pretrained), final pooling/fc removed โ
512ร7ร7feature map. - Sequence: the
7ร7grid is flattened to a length-49 sequence of 512-d vectors. - Recurrence: 2-layer bidirectional LSTM, hidden size 512 (โ 1024-d outputs).
- Attention: additive attention pools the LSTM outputs into one context vector.
- Head:
1024 โ 512 โ 256 โ 29MLP with BatchNorm, ReLU, dropout (0.5).
Input
- RGB image, resized to 224ร224, normalized with ImageNet mean/std
(
[0.485, 0.456, 0.406]/[0.229, 0.224, 0.225]). - Hand presence is verified with MediaPipe before classification.
Output: softmax over 29 classes.
Alphabet label map (index โ letter)
| idx | letter | idx | letter | idx | letter | idx | letter | idx | letter |
|---|---|---|---|---|---|---|---|---|---|
| 0 | ุน | 6 | ู | 12 | ู | 18 | ุฑ | 24 | ุซ |
| 1 | ุฃ | 7 | ู | 13 | ุฎ | 19 | ุต | 25 | ุฐ |
| 2 | ุจ | 8 | ุบ | 14 | ูุง | 20 | ุณ | 26 | ู |
| 3 | ุฏ | 9 | ู | 15 | ู | 21 | ุด | 27 | ู |
| 4 | ุธ | 10 | ุญ | 16 | ู | 22 | ุท | 28 | ุฒ |
| 5 | ุถ | 11 | ุฌ | 17 | ู | 23 | ุช |
Reported metrics (to update from paper)
| Metric | Value |
|---|---|
| Test accuracy | TBD |
| Macro F1 | TBD |
| Dataset / split | TBD |
2. Word model โ sign_word_t5_classifier_best_3d.pth
A landmark-based classifier for dynamic word signs.
Architecture (T5EncoderClassifier)
- Base: encoder of
google-t5/t5-small(d_model = 512). - Input projection:
Linear(feature_dim โ 512) โ Dropout โ LayerNorm โ GELU. - Pooling: first-token (CLS-style) hidden state of the encoder.
- Head:
Dropout โ Linear(512 โ 10).
Input
- MediaPipe hand landmarks: 21 landmarks ร 3 coords (x, y, z) for 1 hand โ
feature_dim = 63. - Landmarks are wrist-centered and scaled by the wristโmiddle-MCP distance.
- A single frame's landmarks are tiled to a sequence length of 100 with an all-ones attention mask.
Output: softmax over 10 classes.
Word label map (index โ Arabic โ English)
| idx | Arabic | English |
|---|---|---|
| 0 | ููุงู | sleep |
| 1 | ูุณูุช | be silent |
| 2 | ุญุจ | love |
| 3 | ูุฏุฎู | smoke |
| 4 | ุฏุนู | support |
| 5 | ู ุฑุชุจู | confused |
| 6 | ููู | worried |
| 7 | ููุง | here |
| 8 | ุงูุณูุงู ุนูููู | greeting (peace be upon you) |
| 9 | ุดูุฑุง | thanks |
Reported metrics (to update from paper)
| Metric | Value |
|---|---|
| Test accuracy | TBD |
| Macro F1 | TBD |
| Dataset / split | TBD |
Usage
import torch
from huggingface_hub import hf_hub_download
# --- Alphabet model ---
from models.alphabet_model import ArSLAttentionLSTM # from the CSLR repo
ckpt = hf_hub_download("FatimahEmadEldin/ArSL-Models", "improved_arsl_model.pth")
model = ArSLAttentionLSTM(num_classes=29, hidden_size=512, num_layers=2,
bidirectional=True, dropout_rate=0.5)
state = torch.load(ckpt, map_location="cpu")
state = state.get("model_state_dict", state) if isinstance(state, dict) else state
model.load_state_dict(state, strict=False)
model.eval()
The full inference pipeline (MediaPipe hand detection, preprocessing, the T5 word model, and a web UI) is available in the CSLR repository.
Intended use & limitations
- Intended use: education, accessibility demos, and research on Arabic sign language recognition.
- Limitations: trained on a limited label set (29 letters / 10 words); accuracy depends on lighting, camera angle, hand visibility, and signing style. The word model classifies from a single tiled frame and is not a full continuous-sign sequence model. Not validated for clinical or safety-critical use.
Citation
@misc{arsl_models_2026,
title = {ArSL-Models: Arabic Sign Language Recognition},
author = {Fatimah Emad Eldin},
year = {2026},
howpublished = {\url{https://huggingface.co/FatimahEmadEldin/ArSL-Models}}
}
Paper details and full results to be added.
License
MIT