File size: 8,878 Bytes
19369ae cf2c656 19369ae 75ae364 19369ae cf2c656 19369ae 0d656e7 19369ae cf2c656 a2aa15d 19369ae cf2c656 19369ae 53bd070 19369ae cf2c656 9f9594e 19369ae cf2c656 0d656e7 a2aa15d 19369ae 6ca097e 7b86373 19369ae 53bd070 19369ae 6ca097e 19369ae 6ca097e 19369ae 75ae364 19369ae 6ca097e 19369ae 6ca097e 19369ae 53bd070 19369ae cf2c656 889097e cf2c656 0d656e7 f7964c7 0d656e7 cf2c656 c073100 cf2c656 6ca097e 19369ae 6ca097e a2aa15d 19369ae 53bd070 0d656e7 c073100 6ca097e 19369ae 6ca097e c073100 889097e c073100 19369ae cf2c656 6ca097e 19369ae 6ca097e cf2c656 19369ae cf2c656 19369ae cf2c656 19369ae cf2c656 19369ae cf2c656 0d656e7 19369ae cf2c656 19369ae cf2c656 19369ae cf2c656 19369ae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 |
---
language:
- asm
- mni
- kha
- lus
- grt
- trp
- njz
- pbv
- nag
- eng
- hin
tags:
- modernbert
- masked-language-modeling
- northeast-india
- low-resource-nlp
- northeast bert
- mwirelabs
- token-efficiency
license: cc-by-4.0
pipeline_tag: fill-mask
model-index:
- name: NE-BERT
results:
- task:
type: masked-language-modeling
name: Masked Language Modeling
dataset:
name: NE-BERT Evaluation Corpus
type: synthetic
metrics:
- name: Perplexity
type: perplexity
value: 2.9811
widget:
- text: "Nga leit sha <mask>."
example_title: "Khasi (Location)"
- text: "মই ভাত <mask> ভাল পাওঁ।"
example_title: "Assamese (Action)"
- text: "Anga <mask> cha·jok."
example_title: "Garo (Food)"
inference:
parameters:
mask_token: "<mask>"
---
<p align="center">
<!-- Model -->
<img alt="Model" src="https://img.shields.io/badge/Model-ModernBERT-0A84FF">
<!-- Task -->
<img alt="Task" src="https://img.shields.io/badge/Task-Masked%20Language%20Modeling-34C759">
<!-- Languages -->
<img alt="Languages" src="https://img.shields.io/badge/Supported%20Languages-9%20(+%20EN%2FHI)-AF52DE">
<!-- Region -->
<img alt="Region" src="https://img.shields.io/badge/Region-Northeast%20India-FF9F0A">
<!-- License -->
<img alt="License" src="https://img.shields.io/badge/License-CC--BY--4.0-FC5C65">
</p>
# NE-BERT: Northeast India's Multilingual ModernBERT
**NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves strong **Regional State-of-the-Art (SOTA)** performance across multiple Northeast Indian languages and **2x to 3x faster inference** compared to general multilingual models.
Built on the **ModernBERT** architecture, it supports a context length of **1024 tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
---
## Quick Start
NE-BERT is built on the **ModernBERT** architecture. You must use `transformers>=4.48.0`.
```python
# First, install the library:
# pip install -U transformers
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
# Load NE-BERT (No remote code needed for transformers >= 4.48)
model_name = "MWirelabs/ne-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Example: Nagamese Creole (ISO: nag)
text = "Moi bhat <mask>." # "I [eat] rice"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
# Retrieve top prediction
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id))
# Output: "khai" (eat)
```
---
## Training Data & Strategy
NE-BERT was trained on a meticulously curated corpus using a **Smart-Weighted Sampling** strategy to ensure the low-resource languages were not drowned out by anchor languages.
<div align="center">
<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_data_dist.png" alt="Data Distribution Pie Chart" width="600"/>
</div>
| Language | HF Tag | Script | Corpus Size | Training Strategy |
| :--- | :--- | :--- | :--- | :--- |
| **Assamese** | `asm-Beng` | Bengali-Assamese | ~1M Sentences | Native |
| **Meitei (Manipuri)** | `mni-Beng` | Bengali-Assamese | ~1.3M Sentences | Native |
| **Khasi** | `kha-Latn` | Roman | ~1M Sentences | Native |
| **Mizo** | `lus-Latn` | Roman | ~1M Sentences | Native |
| **Nyishi** | `njz-Latn` | Roman | ~55k Sentences | **Oversampled** (20x) |
| **Nagamese** | `nag-Latn` | Roman | ~13k Sentences | **Oversampled** (20x) |
| **Garo** | `grt-Latn` | Roman | ~10k Sentences | **Oversampled** (20x) |
| **Pnar** | `pbv-Latn` | Roman | ~1k Sentences | **Oversampled** (100x) |
| **Kokborok** | `trp-Latn` | Roman | ~2.5k Sentences | **Oversampled** (100x) |
| **Anchor Languages** | `eng-Latn`/`hin-Deva` | Roman/Devanagari | ~660k Sentences | Downsampled |
### Note on Oversampling
To address the extreme data imbalance (e.g., 1k Pnar sentences vs 3M Hindi sentences), we applied aggressive upsampling to micro-languages. To prevent overfitting on these repeated examples, we utilized **Dynamic Masking** during training. This ensures that the model sees different masking patterns for the same sentence across epochs, forcing it to learn semantic relationships rather than memorizing token sequences.
---
## Evaluation and Benchmarks: Regional SOTA
We evaluated NE-BERT against industry-standard multilingual models (mBERT and IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
### 1. The "Eye Test": Qualitative Comparison
The superiority of NE-BERT is evident when predicting missing words in low-resource languages. While generic models predict punctuation or sub-word fragments, NE-BERT predicts coherent, culturally relevant words.
| Language | Input Sentence | **NE-BERT (Ours)** | mBERT | IndicBERT |
| :--- | :--- | :--- | :--- | :--- |
| **Assamese** | `মই ভাত <mask> ভাল পাওঁ।` <br>*(I like to [eat] rice)* | **খাই** (Eat) <br> *Correct Verb* | `##ি` <br> *Fragment* | `,` <br> *Punctuation* |
| **Khasi** | `Nga leit sha <mask>.` <br>*(I go to [home/market])* | **iing** (Home) <br> *Correct Noun* | `.` <br> *Period* | `s` <br> *Character* |
| **Garo** | `Anga <mask> cha·jok.` <br>*(I [ate] ...)* | **nokni** (Of house) <br> *Real Word* | `-` <br> *Symbol* | `.` <br> *Period* |
### 2. Effectiveness: Perplexity (PPL)
Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
<div align="center">
<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ppl_benchmark_chart.png" alt="Perplexity Benchmark Chart" width="800"/>
</div>
| Language | **NE-BERT** | mBERT | IndicBERT | Verdict |
| :--- | :--- | :--- | :--- | :--- |
| **Pnar** (`pbv`) | **2.51** | 3.74 | 8.25 | **3x Better than IndicBERT** |
| **Khasi** (`kha`) | **2.58** | 2.94 | 6.16 | **Best Specialized Model** |
| **Kokborok** (`trp`) | **2.67** | 3.79 | 7.91 | **Strong SOTA** |
| **Assamese** (`asm`) | 4.19 | **2.34** | 7.26 | *Competitive* |
| **Mizo** (`lus`) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
| **Garo** (`grt`) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |
### 3. Efficiency: Token Fertility (Inference Speed)
Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
<div align="center">
<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/fertility_benchmark_chart.png" alt="Token Fertility Benchmark Chart" width="600"/>
</div>
*Result: NE-BERT is **2x to 3x more token-efficient** on major languages than mBERT and IndicBERT, translating directly to **faster inference** and **lower VRAM consumption** in production.*
---
## Training Performance
<div align="center">
<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_loss_chart.png" alt="Training Convergence Chart" width="800"/>
</div>
* **Final Training Loss:** 1.62
* **Final Validation Loss:** 1.64
* **Convergence:** The model achieved optimal convergence where validation loss tracked closely with training loss, indicating robust generalization despite the small dataset size of rare languages.
## Technical Specifications
* **Architecture:** ModernBERT-Base (Pre-Norm, Rotary Embeddings)
* **Parameters:** ~149 Million
* **Context Window:** **1024 Tokens**
* **Tokenizer:** Custom Unigram SentencePiece (Vocab: 50,368)
* **Training Hardware:** NVIDIA A40 (48GB)
* **Training Duration:** 10 Epochs
## Limitations and Bias
While NE-BERT significantly outperforms existing models on these languages, users should be aware:
* **Meitei/Hindi Leakage:** Due to the shared script and the high volume of Hindi anchor data, the model may sometimes predict Hindi/Sanskrit words (e.g., "Narayan") in Meitei contexts if the sentence structure is ambiguous.
* **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{ne-bert-2025,
author = {MWirelabs},
title = {NE-BERT: A Multilingual ModernBERT for Northeast India},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{[https://huggingface.co/MWirelabs/ne-bert](https://huggingface.co/MWirelabs/ne-bert)}}
} |