MWirelabs
/

ne-bert

+---
+language:
+- asm
+- mni
+- kha
+- lus
+- grt
+- trp
+- njz
+- pnr
+- eng
+- hin
+tags:
+- modernbert
+- masked-language-modeling
+- northeast-india
+- low-resource-nlp
+- mwirelabs
+license: cc-by-4.0
+datasets:
+- MWirelabs/NE-BERT-Raw-Corpus
+pipeline_tag: fill-mask
+widget:
+- text: "Nga leit sha <mask>."
+  example_title: "Khasi (Location)"
+- text: "মই <mask> ভাল পাওঁ।"
+  example_title: "Assamese (Love)"
+- text: "Eina <mask> nungshi."
+  example_title: "Meitei (Love)"
+inference:
+  parameters:
+    mask_token: "<mask>"
+---
+# NE-BERT: Northeast India's First Multilingual ModernBERT
+<div align="center">
+  <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_loss_chart.png" alt="NE-BERT Training Loss" width="800"/>
+</div>
+**NE-BERT** is a state-of-the-art transformer model designed specifically for the low-resource languages of Northeast India. Unlike generic multilingual models (mBERT/XLM-R) which often fail on under-represented languages like Pnar or Kokborok due to vocabulary fragmentation, NE-BERT uses a **Weighted Tokenizer** and **Balanced Sampling** to ensure high-quality representation for 8 indigenous languages.
+Built on the **ModernBERT** architecture, it supports a context length of **8192 tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
+## 🏆 Benchmark: NE-BERT vs. IndicBERT
+We evaluated NE-BERT against `ai4bharat/indic-bert` (the current standard for Indian languages) on a held-out test set of grammatically correct sentences across all 8 languages. **Lower Perplexity (PPL) is better.**
+| Model | Perplexity (PPL) | Verdict |
+| :--- | :--- | :--- |
+| **IndicBERT** | 26.29 | Confused / Random Guessing |
+| **NE-BERT (Ours)** | **5.28** | **Native-Level Fluency** |
+*Result: NE-BERT is ~5x more accurate at understanding the context, grammar, and vocabulary of Northeast Indian languages compared to generic Indian models.*
+## 🌍 Supported Languages & Data
+The model was trained on a custom corpus curated by **MWirelabs**, combining verified monolingual data with aggressive oversampling for micro-languages.
+| Language | ISO Code | Script | Corpus Size | Training Strategy |
+| :--- | :--- | :--- | :--- | :--- |
+| **Assamese** | `asm` | Bengali-Assamese | ~1M Sentences | Native |
+| **Meitei (Manipuri)** | `mni` | Bengali-Assamese | ~1.3M Sentences | Native |
+| **Khasi** | `kha` | Roman | ~1M Sentences | Native |
+| **Mizo** | `lus` | Roman | ~1M Sentences | Native |
+| **Nyishi** | `njz` | Roman | ~55k Sentences | Oversampled (20x) |
+| **Garo** | `grt` | Roman | ~10k Sentences | Oversampled (20x) |
+| **Nagamese** | `nag` | Roman | ~14k Sentences | Oversampled (20x) |
+| **Kokborok** | `trp` | Roman | ~2.5k Sentences | Oversampled (100x) |
+| **Pnar** | `pnr` | Roman | ~1k Sentences | Oversampled (100x) |
+| **English/Hindi** | `eng`/`hin` | Roman/Devanagari | ~660k Sentences | Anchor Languages |
+## 🚀 Quick Use
+You can use NE-BERT directly with the Hugging Face `pipeline`.
+**Note:** NE-BERT uses `<mask>` (XML style) instead of `[MASK]`.
+```python
+from transformers import pipeline
+# 1. Load Model
+unmasker = pipeline("fill-mask", model="MWirelabs/ne-bert", tokenizer="MWirelabs/ne-bert")
+# 2. Test Example (Khasi)
+# Input: "I go to [mask]" (Market/School/Home)
+sentence = "Nga leit sha <mask>."
+predictions = unmasker(sentence)
+for p in predictions[:3]:
+    print(f"{p['token_str']}: {p['score']:.1%}")
+# Expected Output:
+# iew: 25.4% (Market)
+# skul: 15.1% (School)
+# iing: 8.2% (Home)
+```
+## 🔧 Technical Specifications
+* **Architecture:** ModernBERT-Base (Pre-Norm, Rotary Embeddings)
+* **Parameters:** ~149 Million
+* **Context Window:** 8192 Tokens
+* **Tokenizer:** Custom Unigram SentencePiece (Vocab: 50,368)
+* **Training Hardware:** NVIDIA A40 (48GB)
+* **Training Duration:** 10 Epochs
+## ⚠️ Limitations & Bias
+While NE-BERT significantly outperforms existing models on these languages, users should be aware:
+* **Script Sensitivity:** Meitei and Assamese must be provided in the Bengali-Assamese script. Romanized inputs (e.g., "Moi") may yield suboptimal results.
+* **Domain Specificity:** The model is trained largely on general web text and wiki-style articles. It may struggle with highly technical or poetic domains in Pnar/Kokborok due to limited data size.
+## 📚 Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{ne-bert-2025,
+  author = {MWirelabs},
+  title = {NE-BERT: A Multilingual ModernBERT for Northeast India},
+  year = {2025},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Model Hub},
+  howpublished = {\url{[https://huggingface.co/MWirelabs/ne-bert](https://huggingface.co/MWirelabs/ne-bert)}}
+}
+```