distilled_tunbert / README.md

Update README.md

22a5547 verified 4 months ago

6.15 kB

	---
	library_name: transformers
	tags:
	- tunisian-arabic
	- nlp
	- transformers
	- bert
	- distillation
	- low-resource
	- open-source
	- sentiment-analysis
	- language-model
	license: mit
	datasets:
	- hamzabouajila/tunisian-derja-unified-raw-corpus
	language:
	- ar
	base_model:
	- tunis-ai/TunBERT
	---


	# Distilled TunBERT

	A distilled, efficient version of TunBERT for Tunisian Arabic.
	This model is faster, smaller, and fully reproducible thanks to an open Tunisian corpus and transparent distillation pipeline.

	---

	## Model Details

	### Model Description

	* Developed by: Hamza Bouajila
	* Model type: Distilled BERT (student: `distilbert-base-uncased`)
	* Teacher model: [TunBERT](https://huggingface.co/tunis-ai/TunBERT) (frozen)
	* Language(s): Tunisian Arabic (Darija)
	* License: MIT (specify if different)
	* Finetuned from: `distilbert-base-uncased`
	* Status: Research prototype (not production-ready)

	### Model Sources

	* Repository: \[GitHub Link]
	* Model weights: [HuggingFace](https://huggingface.co/hamzabouajila/distilled_tunbert)
	* Paper (draft): Coming soon (arXiv)

	---

	## Uses

	### Direct Use

	* Text classification in Tunisian Arabic (e.g., sentiment analysis, topic classification).
	* Research on knowledge distillation for low-resource languages.
	* Educational use in model efficiency, open corpus training, and reproducibility.

	### Downstream Use

	* Fine-tuning for NLP tasks in Tunisian Arabic: NER, sentiment, intent detection, etc.
	* Embedding-based applications (with caution — embeddings not aligned to teacher).

	### Out-of-Scope Use

	* Not suitable for semantic search or cross-model embedding alignment.
	* Not recommended for critical applications (e.g., healthcare, law) without further evaluation.

	---

	## Bias, Risks, and Limitations

	* Bias: Model inherits cultural/linguistic biases present in the Tunisian corpus.
	* Limitations:

	* Embeddings show near-zero similarity with teacher (`cosine ≈ 0.02`) due to tokenizer mismatch and lack of hidden-state loss.
	* Teacher (TunBERT) itself may have limitations (training data not public).
	* Risk: Misuse in contexts requiring semantic alignment (e.g., search, embeddings).

	### Recommendations

	* Use for classification/logit-based tasks, not for embedding similarity.
	* Consider retraining with hidden-state alignment if embeddings are needed.

	---

	## How to Get Started with the Model

	```python
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("hamzabouajila/distilled_tunbert")
	model = AutoModel.from_pretrained("hamzabouajila/distilled_tunbert")

	text = "نحب النموذج هذا يخدم بسرعه"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)
	```

	---

	## Training Details

	### Training Data

	* Source: Curated open Tunisian Arabic corpus (public release).
	* Transparency: Fully documented and reproducible.

	### Training Procedure

	* Teacher: TunBERT (frozen)
	* Student: distilbert-base-uncased (English) + Tunisian tokenizer
	* Loss: KL-divergence on logits (no hidden-state loss)

	#### Training Hyperparameters

	* Precision: fp16 mixed precision
	* Optimizer: AdamW
	* Batch size / Epochs: \[More Information Needed]
	* Learning rate: \[More Information Needed]

	#### Speeds, Sizes, Times

	* Parameters: 66M (vs 109M for teacher)
	* Avg inference: 0.058s (vs 0.106s → 1.83× faster)
	* Model size: 1.65× smaller

	---

	## Evaluation

	### Testing Data, Factors & Metrics

	* Benchmark task: Tunisian Sentiment Analysis Corpus (TSAC)
	* Metrics: Perplexity, inference speed, parameter count, embedding cosine similarity

	### Results

	\| Metric \| Original TunBERT \| Distilled TunBERT \| Notes \|
	\| ------------------------ \| ---------------- \| ----------------- \| ---------------------------------------------------- \|
	\| Perplexity \| 34838.7 \| 4.26 \| Strong LM performance. Teacher likely uninitialized. \|
	\| Inference Time (s) \| 0.106 \| 0.058 \| 1.83× faster \|
	\| Parameters \| 109M \| 66M \| 1.65× smaller \|
	\| Embedding Similarity \| — \| 0.02 \| Near-zero due to tokenizer mismatch \|
	\| Training Data \| Unknown \| Open corpus \| Fully reproducible \|

	#### Summary

	The distilled model is faster, lighter, and trained on open data.
	It performs competitively on classification tasks but embeddings should not be used for similarity-based applications.

	---

	## Environmental Impact

	* Hardware: NVIDIA V100 (specify if different)
	* Training hours: \[More Information Needed]
	* Cloud provider: \[More Information Needed]
	* Carbon emitted: Estimated via [ML CO₂ Impact Calculator](https://mlco2.github.io/impact#compute)

	---

	## Technical Specifications

	### Model Architecture and Objective

	* Architecture: DistilBERT
	* Objective: Knowledge Distillation (logit alignment only)

	### Compute Infrastructure

	* Hardware: \[e.g., 1× NVIDIA V100 GPU]
	* Software: PyTorch + 🤗 Transformers

	---

	## Citation

	BibTeX:

	```bibtex
	@misc{bouajila2025distilledtunbert,
	title={Distilled TunBERT: Efficient Tunisian Arabic BERT via Knowledge Distillation},
	author={Bouajila Hamza},
	year={2025},
	publisher={HuggingFace},
	howpublished={\url{https://huggingface.co/hamzabouajila/distilled_tunbert}}
	}
	```

	---

	## Model Card Authors

	* Hamza Bouajila

	## Model Card Contact

	* Email: \[bouajilahamza@outlook.com]
	* LinkedIn: \[https://www.linkedin.com/in/hamzabouajila]

	---

	👉 This version positions your model as efficient, open, and reproducible — while honestly stating limitations (embeddings, risks).

	Do you want me to also draft a shorter, lightweight Hugging Face card (2–3 sections only) for quick readers, in addition to this full professional one?