You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

MAFT Language Identification (FastText)

Summary

This model performs multiclass language identification over 9 language labels, including multiple Arabic varieties and Latin-script Moroccan Arabic (Arabizi).

Beyond standard classification metrics, the model supports selective prediction (accept / abstain) using a confidence threshold tau, enabling high-trust deployment scenarios where incorrect predictions are more costly than abstentions.

The model was trained on our MAFT dataset, publicly available on the Hugging Face Hub:
https://huggingface.co/datasets/Fatnaoui/maft

Reproducibility

The full code for model building (scripts, configs, and execution steps) is maintained here:

GitHub (model build pipeline): https://github.com/Fatnaoui/helpers/tree/main/fasttext

This repository is the reference implementation for regenerating the MAFT model release as published on the Hub.

Supported Labels

Label	Description
`__label__en`	English
`__label__fr`	French
`__label__es`	Spanish
`__label__it`	Italian
`__label__ar_msa`	Modern Standard Arabic
`__label__ar_ma`	Moroccan Arabic (Arabic script)
`__label__ar_ma_latin`	Moroccan Arabic (Latin / Arabizi)
`__label__other_ar`	Other Arabic varieties
`__label__other_lg`	Other Latin languages

Standard Metrics (Validation)

Metric	Value
P@1	0.98919
R@1	0.98919
Accuracy (micro)	0.9892
Macro-F1	0.9891

These results indicate strong and balanced performance across all labels.

Selective Prediction Metrics (τ = 0.98)

Selective prediction allows the model to abstain when confidence is below τ.

Metric	Value
TrustPrecision_A	0.9964
Coverage_A	0.7509
A2A error rate	0.001312
MacroPrecision_A	0.9964

Interpretation

99.64% of accepted predictions are correct
The model answers ~75% of inputs at this confidence level
Very few high-confidence mistakes leak through (A2A ≈ 0.13%)

Diagnostic Confidence Signals

Diagnostic	Value
`avg_top1_p_accept`	0.9964
`avg_top1_p_abstain`	0.9657
`avg_margin_accept`	0.9937
`avg_margin_abstain`	0.9386

These diagnostics confirm a clean separation between accepted and abstained predictions.

Per-Class Performance (Worst → Best by F1)

Label	F1	Precision	Recall	Support
`other_ar`	0.9764	0.9844	0.9686	10,076
`ar_msa`	0.9857	0.9784	0.9932	10,076
`ar_ma`	0.9867	0.9881	0.9854	10,076
`en`	0.9875	0.9821	0.9930	10,076
`es`	0.9886	0.9893	0.9880	10,076
`it`	0.9901	0.9938	0.9864	10,076
`fr`	0.9948	0.9948	0.9947	10,076
`ar_ma_latin`	0.9950	0.9944	0.9956	10,076
`other_lg`	0.9971	0.9970	0.9971	10,858

Most Frequent Confusions (True → Predicted)

other_ar → ar_msa : 152
other_ar → ar_ma : 106
ar_ma → other_ar : 79
es → en : 72
ar_ma → ar_msa : 68
it → en : 65
ar_msa → other_ar : 55

These errors are linguistically plausible and reflect genuine language overlap rather than model failure.

Intended Use

Language identification for short to medium texts
Multilingual NLP preprocessing pipelines
Dialect-aware Arabic text routing
High-trust or risk-sensitive applications using selective prediction

Usage

from huggingface_hub import hf_hub_download
import fasttext

model = fasttext.load_model(hf_hub_download("Morocco-MTNRA-Labs/MAFT_LangID", "model.bin"))

# Confidence threshold used in evaluation
# And make sure you are working with numpy==1.x
TAU = 0.9

examples = [
    "had lblad zwina بزاف",                            # Moroccan Arabic (Latin / Arabizi)
    "i can't trust this person on my personal life",   # English
    "Questo è un testo scritto in italiano"            # Italy
]

for text in examples:
    labels, scores = model.predict(text, k=1)
    label, score = labels[0], scores[0]

    if score >= TAU:
        print(f"ACCEPT → {label} (confidence={score:.3f}) | text='{text[:30]}'")
    else:
        print(f"ABSTAIN → confidence={score:.3f} | text='{text[:30]}'")

### Output ###
ACCEPT  →  __label__ar_ma_latin (confidence=0.999)   |   text='had lblad zwina بزاف'
ACCEPT  →  __label__en (confidence=0.941)            |   text='i can't trust this person on m'
ACCEPT  →  __label__it (confidence=1.000)            |   text='Questo è un testo scritto in i'

Recommended Usage Pattern

Run prediction with confidence scores
Apply threshold τ (e.g., 0.98)
Accept high-confidence predictions
Abstain on uncertain samples and route them to:
- a stronger model,
- domain-specific rules,
- or human review

This enables controlled deployment with predictable risk.

Design Philosophy

This model follows a trust-first strategy:

Do not guess when uncertain
Expose risk explicitly
Prefer abstention over silent errors

Selective metrics (TrustPrecision, Coverage, A2A error rate) are treated as primary signals, not secondary diagnostics.

Data Collection, Filtering, and Training Pipeline

The complete pipeline used to prepare the MAFT dataset and to run training, evaluation, and selective prediction is documented in a separate GitHub repository.

This repository includes:

MAFT dataset preparation utilities (formatting, splits, packaging),
FastText training configuration and scripts,
evaluation scripts (standard and selective metrics).

🔗 GitHub repository (full pipeline):

https://github.com/Fatnaoui/helpers/tree/main/fasttext

Acknowledgements

This work was guided and advised by Mr. Abderrahman Skiredj
🔗 https://www.linkedin.com/in/abderrahman-skiredj-99a80510b/

Model development, experimentation, and evaluation were carried out by Hamza Fatnaoui
🔗 https://www.linkedin.com/in/fatnaoui/

License

Released under the Apache License 2.0.

Contact

For questions, feedback, or collaboration:

Open an issue on the Hugging Face repository
Or reach out via LinkedIn (see Acknowledgements)

Downloads last month: -