|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- tunisian-arabic |
|
|
- nlp |
|
|
- transformers |
|
|
- bert |
|
|
- distillation |
|
|
- low-resource |
|
|
- open-source |
|
|
- sentiment-analysis |
|
|
- language-model |
|
|
license: mit |
|
|
datasets: |
|
|
- hamzabouajila/tunisian-derja-unified-raw-corpus |
|
|
language: |
|
|
- ar |
|
|
base_model: |
|
|
- tunis-ai/TunBERT |
|
|
--- |
|
|
|
|
|
|
|
|
# Distilled TunBERT |
|
|
|
|
|
A distilled, efficient version of **TunBERT** for Tunisian Arabic. |
|
|
This model is **faster, smaller, and fully reproducible** thanks to an **open Tunisian corpus** and transparent distillation pipeline. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
* **Developed by:** Hamza Bouajila |
|
|
* **Model type:** Distilled BERT (student: `distilbert-base-uncased`) |
|
|
* **Teacher model:** [TunBERT](https://huggingface.co/tunis-ai/TunBERT) (frozen) |
|
|
* **Language(s):** Tunisian Arabic (Darija) |
|
|
* **License:** MIT (specify if different) |
|
|
* **Finetuned from:** `distilbert-base-uncased` |
|
|
* **Status:** Research prototype (not production-ready) |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
* **Repository:** \[GitHub Link] |
|
|
* **Model weights:** [HuggingFace](https://huggingface.co/hamzabouajila/distilled_tunbert) |
|
|
* **Paper (draft):** Coming soon (arXiv) |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
* Text classification in Tunisian Arabic (e.g., sentiment analysis, topic classification). |
|
|
* Research on knowledge distillation for low-resource languages. |
|
|
* Educational use in model efficiency, open corpus training, and reproducibility. |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
* Fine-tuning for **NLP tasks in Tunisian Arabic**: NER, sentiment, intent detection, etc. |
|
|
* Embedding-based applications (with caution — embeddings not aligned to teacher). |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
* Not suitable for semantic search or cross-model embedding alignment. |
|
|
* Not recommended for critical applications (e.g., healthcare, law) without further evaluation. |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
* **Bias:** Model inherits cultural/linguistic biases present in the Tunisian corpus. |
|
|
* **Limitations:** |
|
|
|
|
|
* Embeddings show **near-zero similarity** with teacher (`cosine ≈ 0.02`) due to tokenizer mismatch and lack of hidden-state loss. |
|
|
* Teacher (TunBERT) itself may have limitations (training data not public). |
|
|
* **Risk:** Misuse in contexts requiring semantic alignment (e.g., search, embeddings). |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
* Use for **classification/logit-based tasks**, not for embedding similarity. |
|
|
* Consider retraining with hidden-state alignment if embeddings are needed. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("hamzabouajila/distilled_tunbert") |
|
|
model = AutoModel.from_pretrained("hamzabouajila/distilled_tunbert") |
|
|
|
|
|
text = "نحب النموذج هذا يخدم بسرعه" |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
* **Source:** Curated open Tunisian Arabic corpus (public release). |
|
|
* **Transparency:** Fully documented and reproducible. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
* **Teacher:** TunBERT (frozen) |
|
|
* **Student:** distilbert-base-uncased (English) + Tunisian tokenizer |
|
|
* **Loss:** KL-divergence on logits (no hidden-state loss) |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
* **Precision:** fp16 mixed precision |
|
|
* **Optimizer:** AdamW |
|
|
* **Batch size / Epochs:** \[More Information Needed] |
|
|
* **Learning rate:** \[More Information Needed] |
|
|
|
|
|
#### Speeds, Sizes, Times |
|
|
|
|
|
* Parameters: **66M** (vs 109M for teacher) |
|
|
* Avg inference: **0.058s** (vs 0.106s → **1.83× faster**) |
|
|
* Model size: **1.65× smaller** |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
* **Benchmark task:** Tunisian Sentiment Analysis Corpus (TSAC) |
|
|
* **Metrics:** Perplexity, inference speed, parameter count, embedding cosine similarity |
|
|
|
|
|
### Results |
|
|
|
|
|
| Metric | Original TunBERT | Distilled TunBERT | Notes | |
|
|
| ------------------------ | ---------------- | ----------------- | ---------------------------------------------------- | |
|
|
| **Perplexity** | 34838.7 | **4.26** | Strong LM performance. Teacher likely uninitialized. | |
|
|
| **Inference Time (s)** | 0.106 | **0.058** | **1.83× faster** | |
|
|
| **Parameters** | 109M | **66M** | **1.65× smaller** | |
|
|
| **Embedding Similarity** | — | **0.02** | Near-zero due to tokenizer mismatch | |
|
|
| **Training Data** | Unknown | **Open corpus** | Fully reproducible | |
|
|
|
|
|
#### Summary |
|
|
|
|
|
The distilled model is **faster, lighter, and trained on open data**. |
|
|
It performs competitively on classification tasks but embeddings should not be used for similarity-based applications. |
|
|
|
|
|
--- |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
* **Hardware:** NVIDIA V100 (specify if different) |
|
|
* **Training hours:** \[More Information Needed] |
|
|
* **Cloud provider:** \[More Information Needed] |
|
|
* **Carbon emitted:** Estimated via [ML CO₂ Impact Calculator](https://mlco2.github.io/impact#compute) |
|
|
|
|
|
--- |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
|
|
|
* **Architecture:** DistilBERT |
|
|
* **Objective:** Knowledge Distillation (logit alignment only) |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
* **Hardware:** \[e.g., 1× NVIDIA V100 GPU] |
|
|
* **Software:** PyTorch + 🤗 Transformers |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
```bibtex |
|
|
@misc{bouajila2025distilledtunbert, |
|
|
title={Distilled TunBERT: Efficient Tunisian Arabic BERT via Knowledge Distillation}, |
|
|
author={Bouajila Hamza}, |
|
|
year={2025}, |
|
|
publisher={HuggingFace}, |
|
|
howpublished={\url{https://huggingface.co/hamzabouajila/distilled_tunbert}} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
* Hamza Bouajila |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
* Email: \[bouajilahamza@outlook.com] |
|
|
* LinkedIn: \[https://www.linkedin.com/in/hamzabouajila] |
|
|
|
|
|
--- |
|
|
|
|
|
👉 This version positions your model as **efficient, open, and reproducible** — while honestly stating limitations (embeddings, risks). |
|
|
|
|
|
Do you want me to also draft a **shorter, lightweight Hugging Face card** (2–3 sections only) for quick readers, in addition to this full professional one? |