|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
## Ishara: ASL Fingerspelling Recognition |
|
|
|
|
|
Ishara is a deep learning model designed for accurate recognition of American Sign Language (ASL) fingerspelling. It is based on a hybrid architecture that combines **Squeezeformer** and **Conformer** blocks with **Conv1D layers** for efficient feature extraction from hand, face, and pose landmark data. |
|
|
|
|
|
This model is a submission to the Google ASLFR Competition and achieves robust performance on character-level prediction tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Ishara processes sequences of normalized hand, face, and pose landmarks to predict fingerspelled words at the character level. The architecture is designed to handle temporal variability and missing data using a combination of: |
|
|
|
|
|
- **Squeezeformer blocks**: For efficient sequence modeling. |
|
|
- **Conformer blocks**: For enhanced feature extraction. |
|
|
- **Conv1D layers**: For initial temporal feature extraction. |
|
|
|
|
|
The output predictions are character-level sequences optimized using **Connectionist Temporal Classification (CTC)** loss. |
|
|
|
|
|
--- |
|
|
|
|
|
## Dataset |
|
|
|
|
|
The model was trained and evaluated on the dataset provided by the [Google ASLFR Competition](https://www.kaggle.com/competitions/asl-fingerspelling), which consists of: |
|
|
|
|
|
- **Hand landmarks**: 21 points each for left and right hands. |
|
|
- **Face landmarks**: 40 key points. |
|
|
- **Pose landmarks**: 10 key points. |
|
|
- **Labels**: Text sequences representing fingerspelled words. |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Inference with TFLite |
|
|
|
|
|
The model is available in TensorFlow Lite format for real-time inference. To use the model: |
|
|
|
|
|
```python |
|
|
import tensorflow as tf |
|
|
|
|
|
# Load the TFLite model |
|
|
interpreter = tf.lite.Interpreter("model.tflite") |
|
|
interpreter.allocate_tensors() |
|
|
|
|
|
# Define input-output |
|
|
input_details = interpreter.get_input_details() |
|
|
output_details = interpreter.get_output_details() |
|
|
|
|
|
# Input a sequence of landmarks |
|
|
input_data = ... # Preprocessed input sequence |
|
|
interpreter.set_tensor(input_details[0]['index'], input_data) |
|
|
interpreter.invoke() |
|
|
|
|
|
# Get the prediction |
|
|
output_data = interpreter.get_tensor(output_details[0]['index']) |
|
|
print("Predicted Sequence:", output_data) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### Training Workflow |
|
|
|
|
|
You can replicate the training process using TensorFlow. The training loop is as follows: |
|
|
|
|
|
```python |
|
|
from model import get_model |
|
|
|
|
|
# Define the model |
|
|
model = get_model( |
|
|
dim=256, |
|
|
num_conv_squeeze_blocks=2, |
|
|
num_conv_conform_blocks=2, |
|
|
kernel_sizes=[11, 5, 3], |
|
|
num_conv_per_block=3, |
|
|
dropout_rate=0.2 |
|
|
) |
|
|
|
|
|
# Train the model |
|
|
history = model.fit( |
|
|
train_dataset, |
|
|
validation_data=val_dataset, |
|
|
epochs=N_EPOCHS, |
|
|
callbacks=[validation_callback, lr_callback, WeightDecayCallback()] |
|
|
) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Evaluation |
|
|
|
|
|
The model's performance is evaluated using: |
|
|
|
|
|
- **Levenshtein Distance**: Measures character-level accuracy. |
|
|
- **Normalized Character Error Rate (CER)**: Quantifies the model's robustness. |
|
|
- **Real-Time Inference Speed**: Assessed on 1080p video inputs. |
|
|
|
|
|
--- |
|
|
|
|
|
## Results |
|
|
|
|
|
- **Normalised Levenshtein Distance**: [0.728] |
|
|
- **Inference Speed**: [200ms] |
|
|
- **Model Size**: [17.9 Mb] |
|
|
|
|
|
--- |
|
|
|
|
|
## Deployment |
|
|
|
|
|
The model is optimized for deployment in real-time systems using TensorFlow Lite. This makes it suitable for integration into mobile and embedded systems for ASL recognition tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0). |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **Google ASLFR Competition**: For providing the dataset. |
|
|
- **TensorFlow Team**: For the deep learning framework. |
|
|
- **Paper Authors**: For inspiring the architecture. |
|
|
- [Squeezeformer](https://arxiv.org/abs/2206.00888) |
|
|
- [Conformer](https://arxiv.org/abs/2005.08100) |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please consider citing: |
|
|
|
|
|
``` |
|
|
@misc{ishara_asl, |
|
|
title={Ishara: ASL Fingerspelling Recognition}, |
|
|
author={Niharika Gupta, Tanay Srinivasa, Tanmay Nanda, Zoya Ghoshal}, |
|
|
year={2024}, |
|
|
howpublished={\url{https://huggingface.co/ishara-asl}} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or collaboration, feel free to reach out: |
|
|
|
|
|
- **Tanmay Nanda**: tanmaynanda360@gmail.com |