NE-LID: Northeast Language Identification

NE-LID is a sentence-level language identification model for low-resource languages of Northeast India, trained using a character n-gram fastText classifier.

The model achieves near-ceiling accuracy (99.1%) and is designed to be fast, robust, and reproducible, especially for script-diverse and low-resource settings.

Supported Languages (11)

Language	Family	Script
Assamese	Indo-Aryan	Bengali-Assamese
Bodo	Tibeto-Burman	Devanagari
English	Germanic	Latin
Garo	Tibeto-Burman	Latin
Hindi	Indo-Aryan	Devanagari
Khasi	Austroasiatic	Latin
Kokborok	Tibeto-Burman	Latin
Meitei	Tibeto-Burman	Bengali
Mizo	Tibeto-Burman	Latin
Naga	Tibeto-Burman	Latin
Nyishi	Tibeto-Burman	Latin

Model Details

Model type: fastText supervised classifier
Architecture: Character n-grams (2–5)
Task: Sentence-level Language Identification (LID)
Training data: 22,000 sentences (2,000 per language)
Train / Dev / Test split: 70% / 15% / 15% (stratified)
Evaluation accuracy: 99.09% (macro-F1: 0.99)
Model size: ~10 MB
Inference speed: <5ms per sentence

Why fastText?

Extensive experiments show that character-level models outperform transformer-based language models (e.g., NE-BERT, XLM-R) for Northeast Indian LID.

Key findings:

Transformer models (NE-BERT, XLM-R) achieved only 9-37% accuracy on challenging samples
fastText maintained 99%+ accuracy even on script-diverse, low-resource languages
Character n-grams capture orthographic patterns better than subword tokenization for these languages

This model therefore prioritizes:

✅ Script awareness
✅ Orthographic cues
✅ Low-resource robustness

Performance

Language	Precision	Recall	F1-Score	Support
Assamese	1.00	1.00	1.00	300
Bodo	0.99	0.98	0.99	300
English	0.96	0.99	0.98	300
Garo	0.99	1.00	1.00	300
Hindi	0.96	0.97	0.97	300
Khasi	1.00	0.99	0.99	300
Kokborok	1.00	0.99	1.00	300
Meitei	1.00	0.99	1.00	300
Mizo	0.99	0.99	0.99	300
Naga	1.00	1.00	1.00	300
Nyishi	1.00	0.99	0.99	300
Overall	0.99	0.99	0.99	3,300

Test Accuracy: 99.09%

Benchmark Comparison

NE-LID significantly outperforms existing language identification systems on Northeast Indian languages:

Model	Overall Accuracy	Coverage (11 languages)
NE-LID (Ours)	99.09%	11/11 ✅
GlotLID	73.12%	9/11 (missing Garo, Naga)
OpenLID (Meta)	42.03%	5/11
IndicLID (AI4Bharat)	39.30%	4/11
LangDetect (Google)	24.33%	3/11

Key Findings:

NE-LID achieves 2.7× higher accuracy than the best competitor (GlotLID)
Existing multilingual models fail to support 6-7 Northeast Indian languages
Character n-gram approach outperforms transformer-based models for script-diverse, low-resource languages

Installation

pip install fasttext

Usage

Basic Usage (Python)

import fasttext

# Load the model
model = fasttext.load_model("ne_lid.bin")

# Predict language
text = "Ki paidbah shnong ki la ia shim bynta ha ka jingïalang"
labels, probs = model.predict(text)

print(f"Language: {labels[0].replace('__label__', '')}")
print(f"Confidence: {probs[0]:.4f}")

Output:

Language: khasi
Confidence: 0.9999

Batch Prediction

texts = [
    "Ka sngi ka lieh",
    "আজি মই বজাৰলৈ গৈছিলোঁ",
    "Mizo tawng hi a ṭha hle"
]

predictions = model.predict(texts)
for text, (label, prob) in zip(texts, zip(*predictions)):
    lang = label.replace('__label__', '')
    print(f"{text[:30]:30} → {lang:10} ({prob:.3f})")

Get Top-K Predictions

# Get top 3 language predictions
labels, probs = model.predict(text, k=3)

for label, prob in zip(labels, probs):
    lang = label.replace('__label__', '')
    print(f"{lang}: {prob:.4f}")

Limitations

Designed for monolingual sentences – not optimized for code-mixed text
Sentence-level only – not designed for word-level or document-level LID
Performance may degrade on extremely short inputs (≤2 tokens)
English/Hindi confusion at 96-97% (expected due to loanwords and script overlap)

Model Files

ne_lid.bin - Main fastText model (binary format)
ne_lid.ftz - Compressed model (optional, for smaller deployments)

Training Details

Data Sources:

Training corpus derived from NE-BERT dataset
2,000 sentences per language, stratified by length and script
Balanced across language families (Austroasiatic, Tibeto-Burman, Indo-Aryan)

Hyperparameters:

Learning rate: 0.1
Epochs: 25
Word n-grams: 1-3
Character n-grams: 2-5
Loss function: Softmax

License

This model is released under Creative Commons Attribution 4.0 International (CC BY 4.0).

You are free to:

✅ Share — copy and redistribute the material
✅ Adapt — remix, transform, and build upon the material

Under the following terms:

Attribution — You must give appropriate credit to MWire Labs

Citation

If you use NE-LID in your research or applications, please cite:

@misc{mwirelabs2025nelid,
  title={NE-LID: Northeast Language Identification},
  author={MWire Labs},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/MWirelabs/ne-lid}}
}

About MWire Labs

MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.

Repository: MWirelabs/ne-lid
Contact: MWire Labs

Acknowledgments

We thank the open-source community and contributors to the NE-BERT corpus that made this work possible.

Last Updated: January 2026
Version: 1.0.0

Downloads last month: 16

Evaluation results

Test Accuracy
self-reported

99.090
Macro F1-Score
self-reported

99.000