File size: 8,878 Bytes
19369ae
cf2c656
19369ae
 
 
 
 
 
 
 
75ae364
19369ae
 
cf2c656
19369ae
 
 
 
0d656e7
19369ae
 
cf2c656
 
a2aa15d
19369ae
 
 
 
 
 
 
 
 
 
 
 
cf2c656
19369ae
 
 
 
 
 
53bd070
19369ae
 
 
cf2c656
9f9594e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19369ae
cf2c656
0d656e7
a2aa15d
19369ae
6ca097e
7b86373
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19369ae
53bd070
19369ae
 
 
6ca097e
 
19369ae
6ca097e
 
19369ae
 
 
 
 
 
 
75ae364
19369ae
 
 
 
6ca097e
19369ae
 
6ca097e
19369ae
53bd070
19369ae
cf2c656
889097e
cf2c656
0d656e7
 
 
 
f7964c7
 
 
 
 
0d656e7
 
cf2c656
c073100
cf2c656
6ca097e
19369ae
6ca097e
a2aa15d
19369ae
 
 
 
 
 
 
 
53bd070
0d656e7
c073100
 
 
6ca097e
19369ae
6ca097e
c073100
889097e
c073100
19369ae
 
 
cf2c656
6ca097e
19369ae
6ca097e
cf2c656
19369ae
 
 
cf2c656
19369ae
cf2c656
19369ae
 
 
 
 
 
cf2c656
19369ae
cf2c656
0d656e7
19369ae
cf2c656
19369ae
cf2c656
 
19369ae
cf2c656
 
 
 
 
 
 
19369ae
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
language:
- asm
- mni
- kha
- lus
- grt
- trp
- njz
- pbv
- nag
- eng
- hin
tags:
- modernbert
- masked-language-modeling
- northeast-india
- low-resource-nlp
- northeast bert
- mwirelabs
- token-efficiency
license: cc-by-4.0
pipeline_tag: fill-mask
model-index:
- name: NE-BERT
  results:
  - task:
      type: masked-language-modeling
      name: Masked Language Modeling
    dataset:
      name: NE-BERT Evaluation Corpus
      type: synthetic
    metrics:
    - name: Perplexity
      type: perplexity
      value: 2.9811
widget:
- text: "Nga leit sha <mask>."
  example_title: "Khasi (Location)"
- text: "মই ভাত <mask> ভাল পাওঁ।"
  example_title: "Assamese (Action)"
- text: "Anga <mask> cha·jok."
  example_title: "Garo (Food)"
inference:
  parameters:
    mask_token: "<mask>"
---

<p align="center">

  <!-- Model -->
  <img alt="Model" src="https://img.shields.io/badge/Model-ModernBERT-0A84FF">

  <!-- Task -->
  <img alt="Task" src="https://img.shields.io/badge/Task-Masked%20Language%20Modeling-34C759">

  <!-- Languages -->
  <img alt="Languages" src="https://img.shields.io/badge/Supported%20Languages-9%20(+%20EN%2FHI)-AF52DE">

  <!-- Region -->
  <img alt="Region" src="https://img.shields.io/badge/Region-Northeast%20India-FF9F0A">

  <!-- License -->
  <img alt="License" src="https://img.shields.io/badge/License-CC--BY--4.0-FC5C65">

</p>


# NE-BERT: Northeast India's Multilingual ModernBERT

**NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves strong **Regional State-of-the-Art (SOTA)** performance across multiple Northeast Indian languages and **2x to 3x faster inference** compared to general multilingual models.

Built on the **ModernBERT** architecture, it supports a context length of **1024 tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.

---
## Quick Start

NE-BERT is built on the **ModernBERT** architecture. You must use `transformers>=4.48.0`.

```python
# First, install the library:
# pip install -U transformers

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load NE-BERT (No remote code needed for transformers >= 4.48)
model_name = "MWirelabs/ne-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Example: Nagamese Creole (ISO: nag)
text = "Moi bhat <mask>." # "I [eat] rice"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

# Retrieve top prediction
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id))
# Output: "khai" (eat)
```

---

## Training Data & Strategy

NE-BERT was trained on a meticulously curated corpus using a **Smart-Weighted Sampling** strategy to ensure the low-resource languages were not drowned out by anchor languages.

<div align="center">
  <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_data_dist.png" alt="Data Distribution Pie Chart" width="600"/>
</div>

| Language | HF Tag | Script | Corpus Size | Training Strategy |
| :--- | :--- | :--- | :--- | :--- |
| **Assamese** | `asm-Beng` | Bengali-Assamese | ~1M Sentences | Native |
| **Meitei (Manipuri)** | `mni-Beng` | Bengali-Assamese | ~1.3M Sentences | Native |
| **Khasi** | `kha-Latn` | Roman | ~1M Sentences | Native |
| **Mizo** | `lus-Latn` | Roman | ~1M Sentences | Native |
| **Nyishi** | `njz-Latn` | Roman | ~55k Sentences | **Oversampled** (20x) |
| **Nagamese** | `nag-Latn` | Roman | ~13k Sentences | **Oversampled** (20x) |
| **Garo** | `grt-Latn` | Roman | ~10k Sentences | **Oversampled** (20x) |
| **Pnar** | `pbv-Latn` | Roman | ~1k Sentences | **Oversampled** (100x) |
| **Kokborok** | `trp-Latn` | Roman | ~2.5k Sentences | **Oversampled** (100x) |
| **Anchor Languages** | `eng-Latn`/`hin-Deva` | Roman/Devanagari | ~660k Sentences | Downsampled |

### Note on Oversampling
To address the extreme data imbalance (e.g., 1k Pnar sentences vs 3M Hindi sentences), we applied aggressive upsampling to micro-languages. To prevent overfitting on these repeated examples, we utilized **Dynamic Masking** during training. This ensures that the model sees different masking patterns for the same sentence across epochs, forcing it to learn semantic relationships rather than memorizing token sequences.

---

## Evaluation and Benchmarks: Regional SOTA

We evaluated NE-BERT against industry-standard multilingual models (mBERT and IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.

### 1. The "Eye Test": Qualitative Comparison

The superiority of NE-BERT is evident when predicting missing words in low-resource languages. While generic models predict punctuation or sub-word fragments, NE-BERT predicts coherent, culturally relevant words.

| Language | Input Sentence | **NE-BERT (Ours)** | mBERT | IndicBERT |
| :--- | :--- | :--- | :--- | :--- |
| **Assamese** | `মই ভাত <mask> ভাল পাওঁ।` <br>*(I like to [eat] rice)* | **খাই** (Eat) <br> *Correct Verb* | `##ি` <br> *Fragment* | `,` <br> *Punctuation* |
| **Khasi** | `Nga leit sha <mask>.` <br>*(I go to [home/market])* | **iing** (Home) <br> *Correct Noun* | `.` <br> *Period* | `s` <br> *Character* |
| **Garo** | `Anga <mask> cha·jok.` <br>*(I [ate] ...)* | **nokni** (Of house) <br> *Real Word* | `-` <br> *Symbol* | `.` <br> *Period* |

### 2. Effectiveness: Perplexity (PPL)

Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.

<div align="center">
  <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ppl_benchmark_chart.png" alt="Perplexity Benchmark Chart" width="800"/>
</div>

| Language | **NE-BERT** | mBERT | IndicBERT | Verdict |
| :--- | :--- | :--- | :--- | :--- |
| **Pnar** (`pbv`) | **2.51** | 3.74 | 8.25 | **3x Better than IndicBERT** |
| **Khasi** (`kha`) | **2.58** | 2.94 | 6.16 | **Best Specialized Model** |
| **Kokborok** (`trp`) | **2.67** | 3.79 | 7.91 | **Strong SOTA** |
| **Assamese** (`asm`) | 4.19 | **2.34** | 7.26 | *Competitive* |
| **Mizo** (`lus`) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
| **Garo** (`grt`) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |

### 3. Efficiency: Token Fertility (Inference Speed)

Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.

<div align="center">
  <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/fertility_benchmark_chart.png" alt="Token Fertility Benchmark Chart" width="600"/>
</div>

*Result: NE-BERT is **2x to 3x more token-efficient** on major languages than mBERT and IndicBERT, translating directly to **faster inference** and **lower VRAM consumption** in production.*

---

## Training Performance

<div align="center">
  <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_loss_chart.png" alt="Training Convergence Chart" width="800"/>
</div>

* **Final Training Loss:** 1.62
* **Final Validation Loss:** 1.64
* **Convergence:** The model achieved optimal convergence where validation loss tracked closely with training loss, indicating robust generalization despite the small dataset size of rare languages.

## Technical Specifications

* **Architecture:** ModernBERT-Base (Pre-Norm, Rotary Embeddings)
* **Parameters:** ~149 Million
* **Context Window:** **1024 Tokens**
* **Tokenizer:** Custom Unigram SentencePiece (Vocab: 50,368)
* **Training Hardware:** NVIDIA A40 (48GB)
* **Training Duration:** 10 Epochs

## Limitations and Bias
While NE-BERT significantly outperforms existing models on these languages, users should be aware:
* **Meitei/Hindi Leakage:** Due to the shared script and the high volume of Hindi anchor data, the model may sometimes predict Hindi/Sanskrit words (e.g., "Narayan") in Meitei contexts if the sentence structure is ambiguous.
* **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.

## Citation
If you use this model in your research, please cite:

```bibtex
@misc{ne-bert-2025,
  author = {MWirelabs},
  title = {NE-BERT: A Multilingual ModernBERT for Northeast India},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{[https://huggingface.co/MWirelabs/ne-bert](https://huggingface.co/MWirelabs/ne-bert)}}
}