distilled_tunbert / README.md
hamzabouajila's picture
Update README.md
22a5547 verified
---
library_name: transformers
tags:
- tunisian-arabic
- nlp
- transformers
- bert
- distillation
- low-resource
- open-source
- sentiment-analysis
- language-model
license: mit
datasets:
- hamzabouajila/tunisian-derja-unified-raw-corpus
language:
- ar
base_model:
- tunis-ai/TunBERT
---
# Distilled TunBERT
A distilled, efficient version of **TunBERT** for Tunisian Arabic.
This model is **faster, smaller, and fully reproducible** thanks to an **open Tunisian corpus** and transparent distillation pipeline.
---
## Model Details
### Model Description
* **Developed by:** Hamza Bouajila
* **Model type:** Distilled BERT (student: `distilbert-base-uncased`)
* **Teacher model:** [TunBERT](https://huggingface.co/tunis-ai/TunBERT) (frozen)
* **Language(s):** Tunisian Arabic (Darija)
* **License:** MIT (specify if different)
* **Finetuned from:** `distilbert-base-uncased`
* **Status:** Research prototype (not production-ready)
### Model Sources
* **Repository:** \[GitHub Link]
* **Model weights:** [HuggingFace](https://huggingface.co/hamzabouajila/distilled_tunbert)
* **Paper (draft):** Coming soon (arXiv)
---
## Uses
### Direct Use
* Text classification in Tunisian Arabic (e.g., sentiment analysis, topic classification).
* Research on knowledge distillation for low-resource languages.
* Educational use in model efficiency, open corpus training, and reproducibility.
### Downstream Use
* Fine-tuning for **NLP tasks in Tunisian Arabic**: NER, sentiment, intent detection, etc.
* Embedding-based applications (with caution — embeddings not aligned to teacher).
### Out-of-Scope Use
* Not suitable for semantic search or cross-model embedding alignment.
* Not recommended for critical applications (e.g., healthcare, law) without further evaluation.
---
## Bias, Risks, and Limitations
* **Bias:** Model inherits cultural/linguistic biases present in the Tunisian corpus.
* **Limitations:**
* Embeddings show **near-zero similarity** with teacher (`cosine ≈ 0.02`) due to tokenizer mismatch and lack of hidden-state loss.
* Teacher (TunBERT) itself may have limitations (training data not public).
* **Risk:** Misuse in contexts requiring semantic alignment (e.g., search, embeddings).
### Recommendations
* Use for **classification/logit-based tasks**, not for embedding similarity.
* Consider retraining with hidden-state alignment if embeddings are needed.
---
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("hamzabouajila/distilled_tunbert")
model = AutoModel.from_pretrained("hamzabouajila/distilled_tunbert")
text = "نحب النموذج هذا يخدم بسرعه"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```
---
## Training Details
### Training Data
* **Source:** Curated open Tunisian Arabic corpus (public release).
* **Transparency:** Fully documented and reproducible.
### Training Procedure
* **Teacher:** TunBERT (frozen)
* **Student:** distilbert-base-uncased (English) + Tunisian tokenizer
* **Loss:** KL-divergence on logits (no hidden-state loss)
#### Training Hyperparameters
* **Precision:** fp16 mixed precision
* **Optimizer:** AdamW
* **Batch size / Epochs:** \[More Information Needed]
* **Learning rate:** \[More Information Needed]
#### Speeds, Sizes, Times
* Parameters: **66M** (vs 109M for teacher)
* Avg inference: **0.058s** (vs 0.106s → **1.83× faster**)
* Model size: **1.65× smaller**
---
## Evaluation
### Testing Data, Factors & Metrics
* **Benchmark task:** Tunisian Sentiment Analysis Corpus (TSAC)
* **Metrics:** Perplexity, inference speed, parameter count, embedding cosine similarity
### Results
| Metric | Original TunBERT | Distilled TunBERT | Notes |
| ------------------------ | ---------------- | ----------------- | ---------------------------------------------------- |
| **Perplexity** | 34838.7 | **4.26** | Strong LM performance. Teacher likely uninitialized. |
| **Inference Time (s)** | 0.106 | **0.058** | **1.83× faster** |
| **Parameters** | 109M | **66M** | **1.65× smaller** |
| **Embedding Similarity** | — | **0.02** | Near-zero due to tokenizer mismatch |
| **Training Data** | Unknown | **Open corpus** | Fully reproducible |
#### Summary
The distilled model is **faster, lighter, and trained on open data**.
It performs competitively on classification tasks but embeddings should not be used for similarity-based applications.
---
## Environmental Impact
* **Hardware:** NVIDIA V100 (specify if different)
* **Training hours:** \[More Information Needed]
* **Cloud provider:** \[More Information Needed]
* **Carbon emitted:** Estimated via [ML CO₂ Impact Calculator](https://mlco2.github.io/impact#compute)
---
## Technical Specifications
### Model Architecture and Objective
* **Architecture:** DistilBERT
* **Objective:** Knowledge Distillation (logit alignment only)
### Compute Infrastructure
* **Hardware:** \[e.g., 1× NVIDIA V100 GPU]
* **Software:** PyTorch + 🤗 Transformers
---
## Citation
**BibTeX:**
```bibtex
@misc{bouajila2025distilledtunbert,
title={Distilled TunBERT: Efficient Tunisian Arabic BERT via Knowledge Distillation},
author={Bouajila Hamza},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/hamzabouajila/distilled_tunbert}}
}
```
---
## Model Card Authors
* Hamza Bouajila
## Model Card Contact
* Email: \[bouajilahamza@outlook.com]
* LinkedIn: \[https://www.linkedin.com/in/hamzabouajila]
---
👉 This version positions your model as **efficient, open, and reproducible** — while honestly stating limitations (embeddings, risks).
Do you want me to also draft a **shorter, lightweight Hugging Face card** (2–3 sections only) for quick readers, in addition to this full professional one?