NE-LID: Northeast Language Identification

License Accuracy

NE-LID is a sentence-level language identification model for low-resource languages of Northeast India, trained using a character n-gram fastText classifier.

The model achieves near-ceiling accuracy (99.1%) and is designed to be fast, robust, and reproducible, especially for script-diverse and low-resource settings.


Supported Languages (11)

Language Family Script
Assamese Indo-Aryan Bengali-Assamese
Bodo Tibeto-Burman Devanagari
English Germanic Latin
Garo Tibeto-Burman Latin
Hindi Indo-Aryan Devanagari
Khasi Austroasiatic Latin
Kokborok Tibeto-Burman Latin
Meitei Tibeto-Burman Bengali
Mizo Tibeto-Burman Latin
Naga Tibeto-Burman Latin
Nyishi Tibeto-Burman Latin

Model Details

  • Model type: fastText supervised classifier
  • Architecture: Character n-grams (2–5)
  • Task: Sentence-level Language Identification (LID)
  • Training data: 22,000 sentences (2,000 per language)
  • Train / Dev / Test split: 70% / 15% / 15% (stratified)
  • Evaluation accuracy: 99.09% (macro-F1: 0.99)
  • Model size: ~10 MB
  • Inference speed: <5ms per sentence

Why fastText?

Extensive experiments show that character-level models outperform transformer-based language models (e.g., NE-BERT, XLM-R) for Northeast Indian LID.

Key findings:

  • Transformer models (NE-BERT, XLM-R) achieved only 9-37% accuracy on challenging samples
  • fastText maintained 99%+ accuracy even on script-diverse, low-resource languages
  • Character n-grams capture orthographic patterns better than subword tokenization for these languages

This model therefore prioritizes:

  • ✅ Script awareness
  • ✅ Orthographic cues
  • ✅ Low-resource robustness

Performance

Language Precision Recall F1-Score Support
Assamese 1.00 1.00 1.00 300
Bodo 0.99 0.98 0.99 300
English 0.96 0.99 0.98 300
Garo 0.99 1.00 1.00 300
Hindi 0.96 0.97 0.97 300
Khasi 1.00 0.99 0.99 300
Kokborok 1.00 0.99 1.00 300
Meitei 1.00 0.99 1.00 300
Mizo 0.99 0.99 0.99 300
Naga 1.00 1.00 1.00 300
Nyishi 1.00 0.99 0.99 300
Overall 0.99 0.99 0.99 3,300

Test Accuracy: 99.09%



Benchmark Comparison

NE-LID significantly outperforms existing language identification systems on Northeast Indian languages:

Model Overall Accuracy Coverage (11 languages)
NE-LID (Ours) 99.09% 11/11 ✅
GlotLID 73.12% 9/11 (missing Garo, Naga)
OpenLID (Meta) 42.03% 5/11
IndicLID (AI4Bharat) 39.30% 4/11
LangDetect (Google) 24.33% 3/11

Benchmark Comparison

Key Findings:

  • NE-LID achieves 2.7× higher accuracy than the best competitor (GlotLID)
  • Existing multilingual models fail to support 6-7 Northeast Indian languages
  • Character n-gram approach outperforms transformer-based models for script-diverse, low-resource languages

Installation

pip install fasttext

Usage

Basic Usage (Python)

import fasttext

# Load the model
model = fasttext.load_model("ne_lid.bin")

# Predict language
text = "Ki paidbah shnong ki la ia shim bynta ha ka jingïalang"
labels, probs = model.predict(text)

print(f"Language: {labels[0].replace('__label__', '')}")
print(f"Confidence: {probs[0]:.4f}")

Output:

Language: khasi
Confidence: 0.9999

Batch Prediction

texts = [
    "Ka sngi ka lieh",
    "আজি মই বজাৰলৈ গৈছিলোঁ",
    "Mizo tawng hi a ṭha hle"
]

predictions = model.predict(texts)
for text, (label, prob) in zip(texts, zip(*predictions)):
    lang = label.replace('__label__', '')
    print(f"{text[:30]:30}{lang:10} ({prob:.3f})")

Get Top-K Predictions

# Get top 3 language predictions
labels, probs = model.predict(text, k=3)

for label, prob in zip(labels, probs):
    lang = label.replace('__label__', '')
    print(f"{lang}: {prob:.4f}")

Limitations

  • Designed for monolingual sentences – not optimized for code-mixed text
  • Sentence-level only – not designed for word-level or document-level LID
  • Performance may degrade on extremely short inputs (≤2 tokens)
  • English/Hindi confusion at 96-97% (expected due to loanwords and script overlap)

Model Files

  • ne_lid.bin - Main fastText model (binary format)
  • ne_lid.ftz - Compressed model (optional, for smaller deployments)

Training Details

Data Sources:

  • Training corpus derived from NE-BERT dataset
  • 2,000 sentences per language, stratified by length and script
  • Balanced across language families (Austroasiatic, Tibeto-Burman, Indo-Aryan)

Hyperparameters:

  • Learning rate: 0.1
  • Epochs: 25
  • Word n-grams: 1-3
  • Character n-grams: 2-5
  • Loss function: Softmax

License

This model is released under Creative Commons Attribution 4.0 International (CC BY 4.0).

You are free to:

  • ✅ Share — copy and redistribute the material
  • ✅ Adapt — remix, transform, and build upon the material

Under the following terms:

  • Attribution — You must give appropriate credit to MWire Labs

Citation

If you use NE-LID in your research or applications, please cite:

@misc{mwirelabs2025nelid,
  title={NE-LID: Northeast Language Identification},
  author={MWire Labs},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/MWirelabs/ne-lid}}
}

About MWire Labs

MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.

Repository: MWirelabs/ne-lid
Contact: MWire Labs


Acknowledgments

We thank the open-source community and contributors to the NE-BERT corpus that made this work possible.


Last Updated: January 2026
Version: 1.0.0

Downloads last month
47
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results