NE-LID: Northeast Language Identification
NE-LID is a sentence-level language identification model for low-resource languages of Northeast India, trained using a character n-gram fastText classifier.
The model achieves near-ceiling accuracy (99.1%) and is designed to be fast, robust, and reproducible, especially for script-diverse and low-resource settings.
Supported Languages (11)
| Language | Family | Script |
|---|---|---|
| Assamese | Indo-Aryan | Bengali-Assamese |
| Bodo | Tibeto-Burman | Devanagari |
| English | Germanic | Latin |
| Garo | Tibeto-Burman | Latin |
| Hindi | Indo-Aryan | Devanagari |
| Khasi | Austroasiatic | Latin |
| Kokborok | Tibeto-Burman | Latin |
| Meitei | Tibeto-Burman | Bengali |
| Mizo | Tibeto-Burman | Latin |
| Naga | Tibeto-Burman | Latin |
| Nyishi | Tibeto-Burman | Latin |
Model Details
- Model type: fastText supervised classifier
- Architecture: Character n-grams (2–5)
- Task: Sentence-level Language Identification (LID)
- Training data: 22,000 sentences (2,000 per language)
- Train / Dev / Test split: 70% / 15% / 15% (stratified)
- Evaluation accuracy: 99.09% (macro-F1: 0.99)
- Model size: ~10 MB
- Inference speed: <5ms per sentence
Why fastText?
Extensive experiments show that character-level models outperform transformer-based language models (e.g., NE-BERT, XLM-R) for Northeast Indian LID.
Key findings:
- Transformer models (NE-BERT, XLM-R) achieved only 9-37% accuracy on challenging samples
- fastText maintained 99%+ accuracy even on script-diverse, low-resource languages
- Character n-grams capture orthographic patterns better than subword tokenization for these languages
This model therefore prioritizes:
- ✅ Script awareness
- ✅ Orthographic cues
- ✅ Low-resource robustness
Performance
| Language | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Assamese | 1.00 | 1.00 | 1.00 | 300 |
| Bodo | 0.99 | 0.98 | 0.99 | 300 |
| English | 0.96 | 0.99 | 0.98 | 300 |
| Garo | 0.99 | 1.00 | 1.00 | 300 |
| Hindi | 0.96 | 0.97 | 0.97 | 300 |
| Khasi | 1.00 | 0.99 | 0.99 | 300 |
| Kokborok | 1.00 | 0.99 | 1.00 | 300 |
| Meitei | 1.00 | 0.99 | 1.00 | 300 |
| Mizo | 0.99 | 0.99 | 0.99 | 300 |
| Naga | 1.00 | 1.00 | 1.00 | 300 |
| Nyishi | 1.00 | 0.99 | 0.99 | 300 |
| Overall | 0.99 | 0.99 | 0.99 | 3,300 |
Test Accuracy: 99.09%
Benchmark Comparison
NE-LID significantly outperforms existing language identification systems on Northeast Indian languages:
| Model | Overall Accuracy | Coverage (11 languages) |
|---|---|---|
| NE-LID (Ours) | 99.09% | 11/11 ✅ |
| GlotLID | 73.12% | 9/11 (missing Garo, Naga) |
| OpenLID (Meta) | 42.03% | 5/11 |
| IndicLID (AI4Bharat) | 39.30% | 4/11 |
| LangDetect (Google) | 24.33% | 3/11 |
Key Findings:
- NE-LID achieves 2.7× higher accuracy than the best competitor (GlotLID)
- Existing multilingual models fail to support 6-7 Northeast Indian languages
- Character n-gram approach outperforms transformer-based models for script-diverse, low-resource languages
Installation
pip install fasttext
Usage
Basic Usage (Python)
import fasttext
# Load the model
model = fasttext.load_model("ne_lid.bin")
# Predict language
text = "Ki paidbah shnong ki la ia shim bynta ha ka jingïalang"
labels, probs = model.predict(text)
print(f"Language: {labels[0].replace('__label__', '')}")
print(f"Confidence: {probs[0]:.4f}")
Output:
Language: khasi
Confidence: 0.9999
Batch Prediction
texts = [
"Ka sngi ka lieh",
"আজি মই বজাৰলৈ গৈছিলোঁ",
"Mizo tawng hi a ṭha hle"
]
predictions = model.predict(texts)
for text, (label, prob) in zip(texts, zip(*predictions)):
lang = label.replace('__label__', '')
print(f"{text[:30]:30} → {lang:10} ({prob:.3f})")
Get Top-K Predictions
# Get top 3 language predictions
labels, probs = model.predict(text, k=3)
for label, prob in zip(labels, probs):
lang = label.replace('__label__', '')
print(f"{lang}: {prob:.4f}")
Limitations
- Designed for monolingual sentences – not optimized for code-mixed text
- Sentence-level only – not designed for word-level or document-level LID
- Performance may degrade on extremely short inputs (≤2 tokens)
- English/Hindi confusion at 96-97% (expected due to loanwords and script overlap)
Model Files
ne_lid.bin- Main fastText model (binary format)ne_lid.ftz- Compressed model (optional, for smaller deployments)
Training Details
Data Sources:
- Training corpus derived from NE-BERT dataset
- 2,000 sentences per language, stratified by length and script
- Balanced across language families (Austroasiatic, Tibeto-Burman, Indo-Aryan)
Hyperparameters:
- Learning rate: 0.1
- Epochs: 25
- Word n-grams: 1-3
- Character n-grams: 2-5
- Loss function: Softmax
License
This model is released under Creative Commons Attribution 4.0 International (CC BY 4.0).
You are free to:
- ✅ Share — copy and redistribute the material
- ✅ Adapt — remix, transform, and build upon the material
Under the following terms:
- Attribution — You must give appropriate credit to MWire Labs
Citation
If you use NE-LID in your research or applications, please cite:
@misc{mwirelabs2025nelid,
title={NE-LID: Northeast Language Identification},
author={MWire Labs},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/MWirelabs/ne-lid}}
}
About MWire Labs
MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.
Repository: MWirelabs/ne-lid
Contact: MWire Labs
Acknowledgments
We thank the open-source community and contributors to the NE-BERT corpus that made this work possible.
Last Updated: January 2026
Version: 1.0.0
- Downloads last month
- 47
Evaluation results
- Test Accuracyself-reported99.090
- Macro F1-Scoreself-reported99.000
