MWirelabs
/

ne-lid

+---
+language:
+- as
+- brx
+- en
+- grt
+- hi
+- kha
+- trp
+- mni
+- lus
+- njz
+- njo
+tags:
+- language-identification
+- fasttext
+- northeast-india
+- low-resource
+- multilingual
+license: cc-by-4.0
+metrics:
+- accuracy
+- f1
+library_name: fasttext
+pipeline_tag: text-classification
+model-index:
+- name: NE-LID
+  results:
+  - task:
+      type: text-classification
+      name: Language Identification
+    metrics:
+    - type: accuracy
+      value: 99.09
+      name: Test Accuracy
+    - type: f1
+      value: 99
+      name: Macro F1-Score
+---
+# NE-LID: Northeast Language Identification
+![License](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)
+![Accuracy](https://img.shields.io/badge/Accuracy-99.09%25-brightgreen)
+NE-LID is a **sentence-level language identification model** for low-resource languages of **Northeast India**, trained using a **character n-gram fastText classifier**.
+The model achieves **near-ceiling accuracy (99.1%)** and is designed to be **fast, robust, and reproducible**, especially for script-diverse and low-resource settings.
+---
+## 🌐 Supported Languages (11)
+| Language | Family | Script |
+|----------|--------|--------|
+| Assamese | Indo-Aryan | Bengali-Assamese |
+| Bodo | Tibeto-Burman | Devanagari |
+| English | Germanic | Latin |
+| Garo | Tibeto-Burman | Latin |
+| Hindi | Indo-Aryan | Devanagari |
+| Khasi | Austroasiatic | Latin |
+| Kokborok | Tibeto-Burman | Latin |
+| Meitei | Tibeto-Burman | Bengali |
+| Mizo | Tibeto-Burman | Latin |
+| Naga | Tibeto-Burman | Latin |
+| Nyishi | Tibeto-Burman | Latin |
+---
+## 📊 Model Details
+- **Model type**: fastText supervised classifier
+- **Architecture**: Character n-grams (2–5)
+- **Task**: Sentence-level Language Identification (LID)
+- **Training data**: 22,000 sentences (2,000 per language)
+- **Train / Dev / Test split**: 70% / 15% / 15% (stratified)
+- **Evaluation accuracy**: **99.09%** (macro-F1: 0.99)
+- **Model size**: ~10 MB
+- **Inference speed**: <5ms per sentence
+---
+## 🎯 Why fastText?
+Extensive experiments show that **character-level models outperform transformer-based language models** (e.g., NE-BERT, XLM-R) for Northeast Indian LID.
+**Key findings:**
+- Transformer models (NE-BERT, XLM-R) achieved only 9-37% accuracy on challenging samples
+- fastText maintained 99%+ accuracy even on script-diverse, low-resource languages
+- Character n-grams capture orthographic patterns better than subword tokenization for these languages
+This model therefore prioritizes:
+- ✅ Script awareness
+- ✅ Orthographic cues
+- ✅ Low-resource robustness
+---
+## 📈 Performance
+| Language | Precision | Recall | F1-Score | Support |
+|----------|-----------|--------|----------|---------|
+| Assamese | 1.00 | 1.00 | 1.00 | 300 |
+| Bodo | 0.99 | 0.98 | 0.99 | 300 |
+| English | 0.96 | 0.99 | 0.98 | 300 |
+| Garo | 0.99 | 1.00 | 1.00 | 300 |
+| Hindi | 0.96 | 0.97 | 0.97 | 300 |
+| Khasi | 1.00 | 0.99 | 0.99 | 300 |
+| Kokborok | 1.00 | 0.99 | 1.00 | 300 |
+| Meitei | 1.00 | 0.99 | 1.00 | 300 |
+| Mizo | 0.99 | 0.99 | 0.99 | 300 |
+| Naga | 1.00 | 1.00 | 1.00 | 300 |
+| Nyishi | 1.00 | 0.99 | 0.99 | 300 |
+| **Overall** | **0.99** | **0.99** | **0.99** | **3,300** |
+**Test Accuracy: 99.09%**
+---
+## 🚀 Installation
+```bash
+pip install fasttext
+```
+---
+## 💻 Usage
+### Basic Usage (Python)
+```python
+import fasttext
+# Load the model
+model = fasttext.load_model("ne_lid.bin")
+# Predict language
+text = "Ki paidbah shnong ki la ia shim bynta ha ka jingïalang"
+labels, probs = model.predict(text)
+print(f"Language: {labels[0].replace('__label__', '')}")
+print(f"Confidence: {probs[0]:.4f}")
+```
+**Output:**
+```
+Language: khasi
+Confidence: 0.9999
+```
+### Batch Prediction
+```python
+texts = [
+    "Ka sngi ka lieh",
+    "আজি মই বজাৰলৈ গৈছিলোঁ",
+    "Mizo tawng hi a ṭha hle"
+]
+predictions = model.predict(texts)
+for text, (label, prob) in zip(texts, zip(*predictions)):
+    lang = label.replace('__label__', '')
+    print(f"{text[:30]:30} → {lang:10} ({prob:.3f})")
+```
+### Get Top-K Predictions
+```python
+# Get top 3 language predictions
+labels, probs = model.predict(text, k=3)
+for label, prob in zip(labels, probs):
+    lang = label.replace('__label__', '')
+    print(f"{lang}: {prob:.4f}")
+```
+---
+## ⚠️ Limitations
+- **Designed for monolingual sentences** – not optimized for code-mixed text
+- **Sentence-level only** – not designed for word-level or document-level LID
+- **Performance may degrade** on extremely short inputs (≤2 tokens)
+- **English/Hindi confusion** at 96-97% (expected due to loanwords and script overlap)
+---
+## 📦 Model Files
+- `ne_lid.bin` - Main fastText model (binary format)
+- `ne_lid.ftz` - Compressed model (optional, for smaller deployments)
+---
+## 🔬 Training Details
+**Data Sources:**
+- Training corpus derived from NE-BERT dataset
+- 2,000 sentences per language, stratified by length and script
+- Balanced across language families (Austroasiatic, Tibeto-Burman, Indo-Aryan)
+**Hyperparameters:**
+- Learning rate: 0.1
+- Epochs: 25
+- Word n-grams: 1-3
+- Character n-grams: 2-5
+- Loss function: Softmax
+---
+## 📄 License
+This model is released under **Creative Commons Attribution 4.0 International (CC BY 4.0)**.
+You are free to:
+- ✅ Share — copy and redistribute the material
+- ✅ Adapt — remix, transform, and build upon the material
+Under the following terms:
+- 📌 Attribution — You must give appropriate credit to MWire Labs
+---
+## 📚 Citation
+If you use NE-LID in your research or applications, please cite:
+```bibtex
+@misc{mwirelabs2025nelid,
+  title={NE-LID: Northeast Language Identification},
+  author={MWire Labs},
+  year={2025},
+  publisher={HuggingFace},
+  howpublished={\url{https://huggingface.co/MWirelabs/ne-lid}}
+}
+```
+---
+## 🏢 About MWire Labs
+**MWire Labs** is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.
+**Repository:** [MWirelabs/ne-lid](https://huggingface.co/MWirelabs/ne-lid)
+**Contact:** [MWire Labs](https://mwirelabs.com)
+---
+## 🙏 Acknowledgments
+We thank the open-source community and contributors to the NE-BERT corpus that made this work possible.
+---
+**Last Updated:** January 2025
+**Version:** 1.0.0