You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

MAFT Language Identification (FastText)

Summary

This model performs multiclass language identification over 9 language labels, including multiple Arabic varieties and Latin-script Moroccan Arabic (Arabizi).

Beyond standard classification metrics, the model supports selective prediction (accept / abstain) using a confidence threshold tau, enabling high-trust deployment scenarios where incorrect predictions are more costly than abstentions.

The model was trained on our MAFT dataset, publicly available on the Hugging Face Hub:
https://huggingface.co/datasets/Fatnaoui/maft


Reproducibility

The full code for model building (scripts, configs, and execution steps) is maintained here:

GitHub (model build pipeline): https://github.com/Fatnaoui/helpers/tree/main/fasttext

This repository is the reference implementation for regenerating the MAFT model release as published on the Hub.


Supported Labels

Label Description
__label__en English
__label__fr French
__label__es Spanish
__label__it Italian
__label__ar_msa Modern Standard Arabic
__label__ar_ma Moroccan Arabic (Arabic script)
__label__ar_ma_latin Moroccan Arabic (Latin / Arabizi)
__label__other_ar Other Arabic varieties
__label__other_lg Other Latin languages

Standard Metrics (Validation)

Metric Value
P@1 0.98919
R@1 0.98919
Accuracy (micro) 0.9892
Macro-F1 0.9891

These results indicate strong and balanced performance across all labels.


Selective Prediction Metrics (τ = 0.98)

Selective prediction allows the model to abstain when confidence is below τ.

Metric Value
TrustPrecision_A 0.9964
Coverage_A 0.7509
A2A error rate 0.001312
MacroPrecision_A 0.9964

Interpretation

  • 99.64% of accepted predictions are correct
  • The model answers ~75% of inputs at this confidence level
  • Very few high-confidence mistakes leak through (A2A ≈ 0.13%)

Diagnostic Confidence Signals

Diagnostic Value
avg_top1_p_accept 0.9964
avg_top1_p_abstain 0.9657
avg_margin_accept 0.9937
avg_margin_abstain 0.9386

These diagnostics confirm a clean separation between accepted and abstained predictions.


Per-Class Performance (Worst → Best by F1)

Label F1 Precision Recall Support
other_ar 0.9764 0.9844 0.9686 10,076
ar_msa 0.9857 0.9784 0.9932 10,076
ar_ma 0.9867 0.9881 0.9854 10,076
en 0.9875 0.9821 0.9930 10,076
es 0.9886 0.9893 0.9880 10,076
it 0.9901 0.9938 0.9864 10,076
fr 0.9948 0.9948 0.9947 10,076
ar_ma_latin 0.9950 0.9944 0.9956 10,076
other_lg 0.9971 0.9970 0.9971 10,858

Most Frequent Confusions (True → Predicted)

  • other_ar ar_msa : 152
  • other_arar_ma : 106
  • ar_maother_ar : 79
  • esen : 72
  • ar_maar_msa : 68
  • iten : 65
  • ar_msaother_ar : 55

These errors are linguistically plausible and reflect genuine language overlap rather than model failure.


Intended Use

  • Language identification for short to medium texts
  • Multilingual NLP preprocessing pipelines
  • Dialect-aware Arabic text routing
  • High-trust or risk-sensitive applications using selective prediction

Usage

from huggingface_hub import hf_hub_download
import fasttext

model = fasttext.load_model(hf_hub_download("Morocco-MTNRA-Labs/MAFT_LangID", "model.bin"))

# Confidence threshold used in evaluation
# And make sure you are working with numpy==1.x
TAU = 0.9

examples = [
    "had lblad zwina بزاف",                            # Moroccan Arabic (Latin / Arabizi)
    "i can't trust this person on my personal life",   # English
    "Questo è un testo scritto in italiano"            # Italy
]

for text in examples:
    labels, scores = model.predict(text, k=1)
    label, score = labels[0], scores[0]

    if score >= TAU:
        print(f"ACCEPT → {label} (confidence={score:.3f}) | text='{text[:30]}'")
    else:
        print(f"ABSTAIN → confidence={score:.3f} | text='{text[:30]}'")
### Output ###
ACCEPT  →  __label__ar_ma_latin (confidence=0.999)   |   text='had lblad zwina بزاف'
ACCEPT  →  __label__en (confidence=0.941)            |   text='i can't trust this person on m'
ACCEPT  →  __label__it (confidence=1.000)            |   text='Questo è un testo scritto in i'

Recommended Usage Pattern

  1. Run prediction with confidence scores
  2. Apply threshold τ (e.g., 0.98)
  3. Accept high-confidence predictions
  4. Abstain on uncertain samples and route them to:
    • a stronger model,
    • domain-specific rules,
    • or human review

This enables controlled deployment with predictable risk.


Design Philosophy

This model follows a trust-first strategy:

  • Do not guess when uncertain
  • Expose risk explicitly
  • Prefer abstention over silent errors

Selective metrics (TrustPrecision, Coverage, A2A error rate) are treated as primary signals, not secondary diagnostics.


Data Collection, Filtering, and Training Pipeline

The complete pipeline used to prepare the MAFT dataset and to run training, evaluation, and selective prediction is documented in a separate GitHub repository.

This repository includes:

  • MAFT dataset preparation utilities (formatting, splits, packaging),
  • FastText training configuration and scripts,
  • evaluation scripts (standard and selective metrics).

🔗 GitHub repository (full pipeline):

https://github.com/Fatnaoui/helpers/tree/main/fasttext


Acknowledgements

This work was guided and advised by Mr. Abderrahman Skiredj
🔗 https://www.linkedin.com/in/abderrahman-skiredj-99a80510b/

Model development, experimentation, and evaluation were carried out by Hamza Fatnaoui
🔗 https://www.linkedin.com/in/fatnaoui/


License

Released under the Apache License 2.0.


Contact

For questions, feedback, or collaboration:

  • Open an issue on the Hugging Face repository
  • Or reach out via LinkedIn (see Acknowledgements)
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support