Badnyal commited on
Commit
19369ae
·
verified ·
1 Parent(s): 53bd070

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -270
README.md CHANGED
@@ -1,320 +1,145 @@
 
1
  language:
2
-
3
- asm-Beng
4
-
5
- mni-Beng
6
-
7
- kha-Latn
8
-
9
- lus-Latn
10
-
11
- grt-Latn
12
-
13
- trp-Latn
14
-
15
- njz-Latn
16
-
17
- pbv-Latn
18
-
19
- eng-Latn
20
-
21
- hin-Deva
22
  tags:
23
-
24
- modernbert
25
-
26
- masked-language-modeling
27
-
28
- northeast-india
29
-
30
- low-resource-nlp
31
-
32
- mwirelabs
33
-
34
- token-efficiency
35
  license: cc-by-4.0
36
  datasets:
37
-
38
- MWirelabs/NE-BERT-Raw-Corpus
39
  pipeline_tag: fill-mask
40
  model-index:
41
-
42
- name: NE-BERT
43
- results:
44
-
45
- task:
46
- type: masked-language-modeling
47
- name: Masked Language Modeling
48
- dataset:
49
- name: NE-BERT Evaluation Corpus
50
- type: synthetic
51
- metrics:
52
-
53
- name: Perplexity
54
- type: perplexity
55
- value: 2.9811
56
  widget:
57
-
58
- text: "Nga leit sha <mask>."
59
- example_title: "Khasi (Location)"
60
-
61
- text: "মই ভাত <mask> ভাল পাওঁ।"
62
- example_title: "Assamese (Action)"
63
-
64
- text: "Anga <mask> cha·jok."
65
- example_title: "Garo (Food)"
66
  inference:
67
- parameters:
68
- mask_token: "<mask>"
 
69
 
70
- NE-BERT: Northeast India's Multilingual ModernBERT
71
 
72
- NE-BERT is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves Regional State-of-the-Art (SOTA) performance and 2x to 3x faster inference compared to general multilingual models.
73
 
74
- Built on the ModernBERT architecture, it supports a context length of 1024 tokens, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
75
 
76
- Training Data & Strategy
77
 
78
- NE-BERT was trained on a meticulously curated corpus using a Smart-Weighted Sampling strategy to ensure the low-resource languages were not drowned out by anchor languages.
 
 
79
 
80
  <div align="center">
81
- <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_data_dist.png" alt="Data Distribution Pie Chart" width="600"/>
82
  </div>
83
 
84
- Language
85
-
86
- HF Tag
87
-
88
- Script
89
-
90
- Corpus Size
91
-
92
- Training Strategy
93
-
94
- Assamese
95
-
96
- asm-Beng
97
-
98
- Bengali-Assamese
99
-
100
- ~1M Sentences
101
-
102
- Native
103
-
104
- Meitei (Manipuri)
105
-
106
- mni-Beng
107
-
108
- Bengali-Assamese
109
-
110
- ~1.3M Sentences
111
-
112
- Native
113
-
114
- Khasi
115
-
116
- kha-Latn
117
-
118
- Roman
119
-
120
- ~1M Sentences
121
-
122
- Native
123
-
124
- Mizo
125
-
126
- lus-Latn
127
-
128
- Roman
129
-
130
- ~1M Sentences
131
-
132
- Native
133
-
134
- Nyishi
135
-
136
- njz-Latn
137
-
138
- Roman
139
-
140
- ~55k Sentences
141
-
142
- Oversampled (20x)
143
-
144
- Garo
145
-
146
- grt-Latn
147
-
148
- Roman
149
-
150
- ~10k Sentences
151
-
152
- Oversampled (20x)
153
-
154
- Pnar
155
-
156
- pbv-Latn
157
 
158
- Roman
 
159
 
160
- ~1k Sentences
161
 
162
- Oversampled (100x)
163
-
164
- Kokborok
165
-
166
- trp-Latn
167
-
168
- Roman
169
-
170
- ~2.5k Sentences
171
-
172
- Oversampled (100x)
173
-
174
- Anchor Languages
175
-
176
- eng-Latn/hin-Deva
177
-
178
- Roman/Devanagari
179
-
180
- ~660k Sentences
181
-
182
- Downsampled
183
-
184
- Note on Oversampling
185
-
186
- To address the extreme data imbalance (e.g., 1k Pnar sentences vs 3M Hindi sentences), we applied aggressive upsampling to micro-languages. To prevent overfitting on these repeated examples, we utilized Dynamic Masking during training. This ensures that the model sees different masking patterns for the same sentence across epochs, forcing it to learn semantic relationships rather than memorizing token sequences.
187
-
188
- Evaluation and Benchmarks: Regional SOTA
189
 
190
  We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
191
 
192
- 1. Effectiveness: Perplexity (PPL)
193
 
194
  Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
195
 
196
  <div align="center">
197
- <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ppl_benchmark_chart.png" alt="Perplexity Benchmark Chart" width="800"/>
198
  </div>
199
 
200
- Language
201
-
202
- NE-BERT
203
-
204
- mBERT
205
-
206
- IndicBERT
207
-
208
- Verdict
209
-
210
- Pnar (pbv)
211
-
212
- 2.51
213
-
214
- 3.74
215
-
216
- 8.25
217
-
218
- 3x Better than IndicBERT
219
-
220
- Khasi (kha)
221
-
222
- 2.58
223
-
224
- 2.94
225
-
226
- 6.16
227
-
228
- Best Specialized Model
229
-
230
- Kokborok (trp)
231
-
232
- 2.67
233
-
234
- 3.79
235
 
236
- 7.91
237
-
238
- Strong SOTA
239
-
240
- Assamese (asm)
241
-
242
- 4.19
243
-
244
- 2.34
245
-
246
- 7.26
247
-
248
- Competitive
249
-
250
- Mizo (lus)
251
-
252
- 3.09
253
-
254
- 3.13
255
-
256
- 6.45
257
-
258
- Best Specialized Model
259
-
260
- Garo (grt)
261
-
262
- 3.80
263
-
264
- 3.32
265
-
266
- 8.64
267
-
268
- Crushes IndicBERT
269
-
270
- 2. Efficiency: Token Fertility (Inference Speed)
271
 
272
  Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
273
 
274
  <div align="center">
275
- <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/fertility_benchmark_chart.png" alt="Token Fertility Benchmark Chart" width="600"/>
276
  </div>
277
 
278
- Result: NE-BERT is 2x to 3x more token-efficient on major languages than mBERT and XLM-R, translating directly to faster inference and lower VRAM consumption in production.
279
 
280
- Training Performance
 
 
281
 
282
  <div align="center">
283
- <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_loss_chart.png" alt="Training Convergence Chart" width="800"/>
284
  </div>
285
 
286
- Final Training Loss: 1.62
287
-
288
- Final Validation Loss: 1.64
289
-
290
- Convergence: The model achieved optimal convergence where validation loss tracked closely with training loss, indicating robust generalization despite the small dataset size of rare languages.
291
-
292
- Technical Specifications
293
-
294
- Architecture: ModernBERT-Base (Pre-Norm, Rotary Embeddings)
295
 
296
- Parameters: ~149 Million
297
 
298
- Context Window: 1024 Tokens
299
-
300
- Tokenizer: Custom Unigram SentencePiece (Vocab: 50,368)
301
-
302
- Training Hardware: NVIDIA A40 (48GB)
303
-
304
- Training Duration: 10 Epochs
305
-
306
- Limitations and Bias
307
 
 
308
  While NE-BERT significantly outperforms existing models on these languages, users should be aware:
 
 
309
 
310
- Meitei Anchor Leak: Qualitative testing revealed a tendency to default to Hindi words when confused in Meitei, due to the shared Bengali script and high-frequency anchor data.
311
-
312
- Domain Specificity: The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
313
-
314
- Citation
315
-
316
  If you use this model in your research, please cite:
317
 
 
318
  @misc{ne-bert-2025,
319
  author = {MWirelabs},
320
  title = {NE-BERT: A Multilingual ModernBERT for Northeast India},
@@ -322,4 +147,4 @@ If you use this model in your research, please cite:
322
  publisher = {Hugging Face},
323
  journal = {Hugging Face Model Hub},
324
  howpublished = {\url{[https://huggingface.co/MWirelabs/ne-bert](https://huggingface.co/MWirelabs/ne-bert)}}
325
- }
 
1
+ ---
2
  language:
3
+ - asm
4
+ - mni
5
+ - kha
6
+ - lus
7
+ - grt
8
+ - trp
9
+ - njz
10
+ - pbv
11
+ - eng
12
+ - hin
 
 
 
 
 
 
 
 
 
 
13
  tags:
14
+ - modernbert
15
+ - masked-language-modeling
16
+ - northeast-india
17
+ - low-resource-nlp
18
+ - mwirelabs
19
+ - token-efficiency
 
 
 
 
 
 
20
  license: cc-by-4.0
21
  datasets:
22
+ - MWirelabs/NE-BERT-Raw-Corpus
 
23
  pipeline_tag: fill-mask
24
  model-index:
25
+ - name: NE-BERT
26
+ results:
27
+ - task:
28
+ type: masked-language-modeling
29
+ name: Masked Language Modeling
30
+ dataset:
31
+ name: NE-BERT Evaluation Corpus
32
+ type: synthetic
33
+ metrics:
34
+ - name: Perplexity
35
+ type: perplexity
36
+ value: 2.9811
 
 
 
37
  widget:
38
+ - text: "Nga leit sha <mask>."
39
+ example_title: "Khasi (Location)"
40
+ - text: "মই ভাত <mask> ভাল পাওঁ।"
41
+ example_title: "Assamese (Action)"
42
+ - text: "Anga <mask> cha·jok."
43
+ example_title: "Garo (Food)"
 
 
 
44
  inference:
45
+ parameters:
46
+ mask_token: "<mask>"
47
+ ---
48
 
49
+ # NE-BERT: Northeast India's Multilingual ModernBERT
50
 
51
+ **NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves **Regional State-of-the-Art (SOTA)** performance and **2x to 3x faster inference** compared to general multilingual models.
52
 
53
+ Built on the **ModernBERT** architecture, it supports a context length of **1024 tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
54
 
55
+ ---
56
 
57
+ ## Training Data & Strategy
58
+
59
+ NE-BERT was trained on a meticulously curated corpus using a **Smart-Weighted Sampling** strategy to ensure the low-resource languages were not drowned out by anchor languages.
60
 
61
  <div align="center">
62
+ <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_data_dist.png" alt="Data Distribution Pie Chart" width="600"/>
63
  </div>
64
 
65
+ | Language | HF Tag | Script | Corpus Size | Training Strategy |
66
+ | :--- | :--- | :--- | :--- | :--- |
67
+ | **Assamese** | `asm-Beng` | Bengali-Assamese | ~1M Sentences | Native |
68
+ | **Meitei (Manipuri)** | `mni-Beng` | Bengali-Assamese | ~1.3M Sentences | Native |
69
+ | **Khasi** | `kha-Latn` | Roman | ~1M Sentences | Native |
70
+ | **Mizo** | `lus-Latn` | Roman | ~1M Sentences | Native |
71
+ | **Nyishi** | `njz-Latn` | Roman | ~55k Sentences | **Oversampled** (20x) |
72
+ | **Garo** | `grt-Latn` | Roman | ~10k Sentences | **Oversampled** (20x) |
73
+ | **Pnar** | `pbv-Latn` | Roman | ~1k Sentences | **Oversampled** (100x) |
74
+ | **Kokborok** | `trp-Latn` | Roman | ~2.5k Sentences | **Oversampled** (100x) |
75
+ | **Anchor Languages** | `eng-Latn`/`hin-Deva` | Roman/Devanagari | ~660k Sentences | Downsampled |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
+ ### Note on Oversampling
78
+ To address the extreme data imbalance (e.g., 1k Pnar sentences vs 3M Hindi sentences), we applied aggressive upsampling to micro-languages. To prevent overfitting on these repeated examples, we utilized **Dynamic Masking** during training. This ensures that the model sees different masking patterns for the same sentence across epochs, forcing it to learn semantic relationships rather than memorizing token sequences.
79
 
80
+ ---
81
 
82
+ ## Evaluation and Benchmarks: Regional SOTA
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
85
 
86
+ ### 1. Effectiveness: Perplexity (PPL)
87
 
88
  Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
89
 
90
  <div align="center">
91
+ <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ppl_benchmark_chart.png" alt="Perplexity Benchmark Chart" width="800"/>
92
  </div>
93
 
94
+ | Language | **NE-BERT** | mBERT | IndicBERT | Verdict |
95
+ | :--- | :--- | :--- | :--- | :--- |
96
+ | **Pnar** (`pbv`) | **2.51** | 3.74 | 8.25 | **3x Better than IndicBERT** |
97
+ | **Khasi** (`kha`) | **2.58** | 2.94 | 6.16 | **Best Specialized Model** |
98
+ | **Kokborok** (`trp`) | **2.67** | 3.79 | 7.91 | **Strong SOTA** |
99
+ | **Assamese** (`asm`) | 4.19 | **2.34** | 7.26 | *Competitive* |
100
+ | **Mizo** (`lus`) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
101
+ | **Garo** (`grt`) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
+ ### 2. Efficiency: Token Fertility (Inference Speed)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
  Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
106
 
107
  <div align="center">
108
+ <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/fertility_benchmark_chart.png" alt="Token Fertility Benchmark Chart" width="600"/>
109
  </div>
110
 
111
+ *Result: NE-BERT is **2x to 3x more token-efficient** on major languages than mBERT and XLM-R, translating directly to **faster inference** and **lower VRAM consumption** in production.*
112
 
113
+ ---
114
+
115
+ ## Training Performance
116
 
117
  <div align="center">
118
+ <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_loss_chart.png" alt="Training Convergence Chart" width="800"/>
119
  </div>
120
 
121
+ * **Final Training Loss:** 1.62
122
+ * **Final Validation Loss:** 1.64
123
+ * **Convergence:** The model achieved optimal convergence where validation loss tracked closely with training loss, indicating robust generalization despite the small dataset size of rare languages.
 
 
 
 
 
 
124
 
125
+ ## Technical Specifications
126
 
127
+ * **Architecture:** ModernBERT-Base (Pre-Norm, Rotary Embeddings)
128
+ * **Parameters:** ~149 Million
129
+ * **Context Window:** **1024 Tokens**
130
+ * **Tokenizer:** Custom Unigram SentencePiece (Vocab: 50,368)
131
+ * **Training Hardware:** NVIDIA A40 (48GB)
132
+ * **Training Duration:** 10 Epochs
 
 
 
133
 
134
+ ## Limitations and Bias
135
  While NE-BERT significantly outperforms existing models on these languages, users should be aware:
136
+ * **Meitei Anchor Leak:** Qualitative testing revealed a tendency to default to Hindi words when confused in Meitei, due to the shared Bengali script and high-frequency anchor data.
137
+ * **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
138
 
139
+ ## Citation
 
 
 
 
 
140
  If you use this model in your research, please cite:
141
 
142
+ ```bibtex
143
  @misc{ne-bert-2025,
144
  author = {MWirelabs},
145
  title = {NE-BERT: A Multilingual ModernBERT for Northeast India},
 
147
  publisher = {Hugging Face},
148
  journal = {Hugging Face Model Hub},
149
  howpublished = {\url{[https://huggingface.co/MWirelabs/ne-bert](https://huggingface.co/MWirelabs/ne-bert)}}
150
+ }