Instructions to use Morocco-MTNRA-Labs/MAFT_LangID with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use Morocco-MTNRA-Labs/MAFT_LangID with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("Morocco-MTNRA-Labs/MAFT_LangID", "model.bin")) - Notebooks
- Google Colab
- Kaggle
- MAFT Language Identification (FastText)
- Summary
- Reproducibility
- Supported Labels
- Standard Metrics (Validation)
- Selective Prediction Metrics (τ = 0.98)
- Diagnostic Confidence Signals
- Per-Class Performance (Worst → Best by F1)
- Most Frequent Confusions (True → Predicted)
- Intended Use
- Usage
- Recommended Usage Pattern
- Design Philosophy
- Data Collection, Filtering, and Training Pipeline
- Acknowledgements
- License
- Contact
- Summary
MAFT Language Identification (FastText)
Summary
This model performs multiclass language identification over 9 language labels, including multiple Arabic varieties and Latin-script Moroccan Arabic (Arabizi).
Beyond standard classification metrics, the model supports selective prediction (accept / abstain) using a confidence threshold tau, enabling high-trust deployment scenarios where incorrect predictions are more costly than abstentions.
The model was trained on our MAFT dataset, publicly available on the Hugging Face Hub:
https://huggingface.co/datasets/Fatnaoui/maft
Reproducibility
The full code for model building (scripts, configs, and execution steps) is maintained here:
GitHub (model build pipeline): https://github.com/Fatnaoui/helpers/tree/main/fasttext
This repository is the reference implementation for regenerating the MAFT model release as published on the Hub.
Supported Labels
| Label | Description |
|---|---|
__label__en |
English |
__label__fr |
French |
__label__es |
Spanish |
__label__it |
Italian |
__label__ar_msa |
Modern Standard Arabic |
__label__ar_ma |
Moroccan Arabic (Arabic script) |
__label__ar_ma_latin |
Moroccan Arabic (Latin / Arabizi) |
__label__other_ar |
Other Arabic varieties |
__label__other_lg |
Other Latin languages |
Standard Metrics (Validation)
| Metric | Value |
|---|---|
| P@1 | 0.98919 |
| R@1 | 0.98919 |
| Accuracy (micro) | 0.9892 |
| Macro-F1 | 0.9891 |
These results indicate strong and balanced performance across all labels.
Selective Prediction Metrics (τ = 0.98)
Selective prediction allows the model to abstain when confidence is below τ.
| Metric | Value |
|---|---|
| TrustPrecision_A | 0.9964 |
| Coverage_A | 0.7509 |
| A2A error rate | 0.001312 |
| MacroPrecision_A | 0.9964 |
Interpretation
- 99.64% of accepted predictions are correct
- The model answers ~75% of inputs at this confidence level
- Very few high-confidence mistakes leak through (A2A ≈ 0.13%)
Diagnostic Confidence Signals
| Diagnostic | Value |
|---|---|
avg_top1_p_accept |
0.9964 |
avg_top1_p_abstain |
0.9657 |
avg_margin_accept |
0.9937 |
avg_margin_abstain |
0.9386 |
These diagnostics confirm a clean separation between accepted and abstained predictions.
Per-Class Performance (Worst → Best by F1)
| Label | F1 | Precision | Recall | Support |
|---|---|---|---|---|
other_ar |
0.9764 | 0.9844 | 0.9686 | 10,076 |
ar_msa |
0.9857 | 0.9784 | 0.9932 | 10,076 |
ar_ma |
0.9867 | 0.9881 | 0.9854 | 10,076 |
en |
0.9875 | 0.9821 | 0.9930 | 10,076 |
es |
0.9886 | 0.9893 | 0.9880 | 10,076 |
it |
0.9901 | 0.9938 | 0.9864 | 10,076 |
fr |
0.9948 | 0.9948 | 0.9947 | 10,076 |
ar_ma_latin |
0.9950 | 0.9944 | 0.9956 | 10,076 |
other_lg |
0.9971 | 0.9970 | 0.9971 | 10,858 |
Most Frequent Confusions (True → Predicted)
other_ar→ar_msa: 152other_ar→ar_ma: 106ar_ma→other_ar: 79es→en: 72ar_ma→ar_msa: 68it→en: 65ar_msa→other_ar: 55
These errors are linguistically plausible and reflect genuine language overlap rather than model failure.
Intended Use
- Language identification for short to medium texts
- Multilingual NLP preprocessing pipelines
- Dialect-aware Arabic text routing
- High-trust or risk-sensitive applications using selective prediction
Usage
from huggingface_hub import hf_hub_download
import fasttext
model = fasttext.load_model(hf_hub_download("Morocco-MTNRA-Labs/MAFT_LangID", "model.bin"))
# Confidence threshold used in evaluation
# And make sure you are working with numpy==1.x
TAU = 0.9
examples = [
"had lblad zwina بزاف", # Moroccan Arabic (Latin / Arabizi)
"i can't trust this person on my personal life", # English
"Questo è un testo scritto in italiano" # Italy
]
for text in examples:
labels, scores = model.predict(text, k=1)
label, score = labels[0], scores[0]
if score >= TAU:
print(f"ACCEPT → {label} (confidence={score:.3f}) | text='{text[:30]}'")
else:
print(f"ABSTAIN → confidence={score:.3f} | text='{text[:30]}'")
### Output ###
ACCEPT → __label__ar_ma_latin (confidence=0.999) | text='had lblad zwina بزاف'
ACCEPT → __label__en (confidence=0.941) | text='i can't trust this person on m'
ACCEPT → __label__it (confidence=1.000) | text='Questo è un testo scritto in i'
Recommended Usage Pattern
- Run prediction with confidence scores
- Apply threshold
τ(e.g., 0.98) - Accept high-confidence predictions
- Abstain on uncertain samples and route them to:
- a stronger model,
- domain-specific rules,
- or human review
This enables controlled deployment with predictable risk.
Design Philosophy
This model follows a trust-first strategy:
- Do not guess when uncertain
- Expose risk explicitly
- Prefer abstention over silent errors
Selective metrics (TrustPrecision, Coverage, A2A error rate) are treated as primary signals, not secondary diagnostics.
Data Collection, Filtering, and Training Pipeline
The complete pipeline used to prepare the MAFT dataset and to run training, evaluation, and selective prediction is documented in a separate GitHub repository.
This repository includes:
- MAFT dataset preparation utilities (formatting, splits, packaging),
- FastText training configuration and scripts,
- evaluation scripts (standard and selective metrics).
🔗 GitHub repository (full pipeline):
https://github.com/Fatnaoui/helpers/tree/main/fasttext
Acknowledgements
This work was guided and advised by Mr. Abderrahman Skiredj
🔗 https://www.linkedin.com/in/abderrahman-skiredj-99a80510b/
Model development, experimentation, and evaluation were carried out by Hamza Fatnaoui
🔗 https://www.linkedin.com/in/fatnaoui/
License
Released under the Apache License 2.0.
Contact
For questions, feedback, or collaboration:
- Open an issue on the Hugging Face repository
- Or reach out via LinkedIn (see Acknowledgements)
- Downloads last month
- -