Badnyal commited on
Commit
53bd070
·
verified ·
1 Parent(s): 6ca097e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +272 -92
README.md CHANGED
@@ -1,140 +1,320 @@
1
- ---
2
  language:
3
- - asm
4
- - mni
5
- - kha
6
- - lus
7
- - grt
8
- - trp
9
- - njz
10
- - pbv
11
- - eng
12
- - hin
 
 
 
 
 
 
 
 
 
 
13
  tags:
14
- - modernbert
15
- - masked-language-modeling
16
- - northeast-india
17
- - low-resource-nlp
18
- - mwirelabs
19
- - token-efficiency
 
 
 
 
 
 
20
  license: cc-by-4.0
 
 
 
21
  pipeline_tag: fill-mask
22
  model-index:
23
- - name: NE-BERT
24
- results:
25
- - task:
26
- type: masked-language-modeling
27
- name: Masked Language Modeling
28
- dataset:
29
- name: NE-BERT Evaluation Corpus
30
- type: synthetic
31
- metrics:
32
- - name: Perplexity
33
- type: perplexity
34
- value: 2.9811
 
 
 
35
  widget:
36
- - text: "Nga leit sha <mask>."
37
- example_title: "Khasi (Location)"
38
- - text: "মই ভাত <mask> ভাল পাওঁ।"
39
- example_title: "Assamese (Action)"
40
- - text: "Anga <mask> cha·jok."
41
- example_title: "Garo (Food)"
42
- inference:
43
- parameters:
44
- mask_token: "<mask>"
45
- ---
46
 
47
- # NE-BERT: Northeast India's Multilingual ModernBERT 🚀
 
48
 
49
- **NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves **Regional State-of-the-Art (SOTA)** performance and **$2\text{x}$ to $3\text{x}$ faster inference** compared to general multilingual models.
 
 
 
 
 
 
 
50
 
51
- Built on the **ModernBERT** architecture, it supports a context length of **$1024$ tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
52
 
53
- ---
54
 
55
- ## 💾 Training Data & Strategy
56
 
57
- NE-BERT was trained on a meticulously curated corpus using a **Smart-Weighted Sampling** strategy to ensure the low-resource languages were not drowned out by anchor languages.
 
 
58
 
59
  <div align="center">
60
- <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_data_dist.png" alt="Data Distribution Pie Chart" width="600"/>
61
  </div>
62
 
63
- | Language | HF Tag | Script | Corpus Size | Training Strategy |
64
- | :--- | :--- | :--- | :--- | :--- |
65
- | **Assamese** | `asm-Beng` | Bengali-Assamese | $\approx 1\text{M}$ Sentences | Native |
66
- | **Meitei (Manipuri)** | `mni-Beng` | Bengali-Assamese | $\approx 1.3\text{M}$ Sentences | Native |
67
- | **Khasi** | `kha-Latn` | Roman | $\approx 1\text{M}$ Sentences | Native |
68
- | **Mizo** | `lus-Latn` | Roman | $\approx 1\text{M}$ Sentences | Native |
69
- | **Nyishi** | `njz-Latn` | Roman | $\approx 55\text{k}$ Sentences | **Oversampled** ($20\text{x}$) |
70
- | **Garo** | `grt-Latn` | Roman | $\approx 10\text{k}$ Sentences | **Oversampled** ($20\text{x}$) |
71
- | **Pnar** | `pbv-Latn` | Roman | $\approx 1\text{k}$ Sentences | **Oversampled** ($100\text{x}$) |
72
- | **Kokborok** | `trp-Latn` | Roman | $\approx 2.5\text{k}$ Sentences | **Oversampled** ($100\text{x}$) |
73
- | **Anchor Languages** | `eng-Latn`/`hin-Deva` | Roman/Devanagari | $\approx 660\text{k}$ Sentences | Downsampled |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
- ---
76
 
77
- ## 📈 Evaluation and Benchmarks: Regional SOTA
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
  We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
80
 
81
- ### 1. Effectiveness: Perplexity (PPL)
82
 
83
  Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
84
 
85
  <div align="center">
86
- <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ppl_benchmark_chart.png" alt="Perplexity Benchmark Chart" width="800"/>
87
  </div>
88
 
89
- | Language | **NE-BERT** | mBERT | IndicBERT | Verdict |
90
- | :--- | :--- | :--- | :--- | :--- |
91
- | **Pnar** ($\text{pbv}$) | **2.51** | 3.74 | 8.25 | **$3\times$ Better than IndicBERT** |
92
- | **Khasi** ($\text{kha}$) | **2.58** | 2.94 | 6.16 | **Best Specialized Model** |
93
- | **Kokborok** ($\text{trp}$) | **2.67** | 3.79 | 7.91 | **Strong SOTA** |
94
- | Assamese ($\text{asm}$) | 4.19 | **2.34** | 7.26 | *Competitive/Best Specialized Model* |
95
- | Mizo ($\text{lus}$) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
96
- | **Garo** ($\text{grt}$) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
- ### 2. Efficiency: Token Fertility (Inference Speed)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
  Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
101
 
102
  <div align="center">
103
- <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/fertility_benchmark_chart.png" alt="Token Fertility Benchmark Chart" width="600"/>
104
  </div>
105
 
106
- *Result: NE-BERT is **$2\text{x}$ to $3\text{x}$ more token-efficient** on major languages than mBERT and XLM-R, translating directly to **faster inference** and **lower VRAM consumption** in production.*
107
-
108
- ---
109
 
110
- ## Training Performance
111
 
112
  <div align="center">
113
- <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_loss_chart.png" alt="Training Convergence Chart" width="800"/>
114
  </div>
115
 
116
- * **Final Training Loss:** 1.62
117
- * **Final Validation Loss:** 1.64
118
- * **Convergence:** The model achieved optimal convergence where validation loss tracked closely with training loss, indicating robust generalization despite the small dataset size of rare languages.
 
 
 
 
 
 
119
 
120
- ## Technical Specifications
121
 
122
- * **Architecture:** ModernBERT-Base (Pre-Norm, Rotary Embeddings)
123
- * **Parameters:** $\approx 149$ Million
124
- * **Context Window:** **$1024$ Tokens**
125
- * **Tokenizer:** Custom Unigram SentencePiece (Vocab: 50,368)
126
- * **Training Hardware:** NVIDIA A40 (48GB)
127
- * **Training Duration:** $10$ Epochs
 
 
 
128
 
129
- ## Limitations and Bias
130
  While NE-BERT significantly outperforms existing models on these languages, users should be aware:
131
- * **Meitei Anchor Leak:** Qualitative testing revealed a tendency to default to Hindi words when confused in Meitei, due to the shared Bengali script and high-frequency anchor data.
132
- * **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
133
 
134
- ## Citation
 
 
 
 
 
135
  If you use this model in your research, please cite:
136
 
137
- ```bibtex
138
  @misc{ne-bert-2025,
139
  author = {MWirelabs},
140
  title = {NE-BERT: A Multilingual ModernBERT for Northeast India},
@@ -142,4 +322,4 @@ If you use this model in your research, please cite:
142
  publisher = {Hugging Face},
143
  journal = {Hugging Face Model Hub},
144
  howpublished = {\url{[https://huggingface.co/MWirelabs/ne-bert](https://huggingface.co/MWirelabs/ne-bert)}}
145
- }
 
 
1
  language:
2
+
3
+ asm-Beng
4
+
5
+ mni-Beng
6
+
7
+ kha-Latn
8
+
9
+ lus-Latn
10
+
11
+ grt-Latn
12
+
13
+ trp-Latn
14
+
15
+ njz-Latn
16
+
17
+ pbv-Latn
18
+
19
+ eng-Latn
20
+
21
+ hin-Deva
22
  tags:
23
+
24
+ modernbert
25
+
26
+ masked-language-modeling
27
+
28
+ northeast-india
29
+
30
+ low-resource-nlp
31
+
32
+ mwirelabs
33
+
34
+ token-efficiency
35
  license: cc-by-4.0
36
+ datasets:
37
+
38
+ MWirelabs/NE-BERT-Raw-Corpus
39
  pipeline_tag: fill-mask
40
  model-index:
41
+
42
+ name: NE-BERT
43
+ results:
44
+
45
+ task:
46
+ type: masked-language-modeling
47
+ name: Masked Language Modeling
48
+ dataset:
49
+ name: NE-BERT Evaluation Corpus
50
+ type: synthetic
51
+ metrics:
52
+
53
+ name: Perplexity
54
+ type: perplexity
55
+ value: 2.9811
56
  widget:
 
 
 
 
 
 
 
 
 
 
57
 
58
+ text: "Nga leit sha <mask>."
59
+ example_title: "Khasi (Location)"
60
 
61
+ text: "মই ভাত <mask> ভাল পাওঁ।"
62
+ example_title: "Assamese (Action)"
63
+
64
+ text: "Anga <mask> cha·jok."
65
+ example_title: "Garo (Food)"
66
+ inference:
67
+ parameters:
68
+ mask_token: "<mask>"
69
 
70
+ NE-BERT: Northeast India's Multilingual ModernBERT
71
 
72
+ NE-BERT is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves Regional State-of-the-Art (SOTA) performance and 2x to 3x faster inference compared to general multilingual models.
73
 
74
+ Built on the ModernBERT architecture, it supports a context length of 1024 tokens, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
75
 
76
+ Training Data & Strategy
77
+
78
+ NE-BERT was trained on a meticulously curated corpus using a Smart-Weighted Sampling strategy to ensure the low-resource languages were not drowned out by anchor languages.
79
 
80
  <div align="center">
81
+ <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_data_dist.png" alt="Data Distribution Pie Chart" width="600"/>
82
  </div>
83
 
84
+ Language
85
+
86
+ HF Tag
87
+
88
+ Script
89
+
90
+ Corpus Size
91
+
92
+ Training Strategy
93
+
94
+ Assamese
95
+
96
+ asm-Beng
97
+
98
+ Bengali-Assamese
99
+
100
+ ~1M Sentences
101
+
102
+ Native
103
+
104
+ Meitei (Manipuri)
105
+
106
+ mni-Beng
107
+
108
+ Bengali-Assamese
109
+
110
+ ~1.3M Sentences
111
+
112
+ Native
113
+
114
+ Khasi
115
+
116
+ kha-Latn
117
+
118
+ Roman
119
+
120
+ ~1M Sentences
121
+
122
+ Native
123
+
124
+ Mizo
125
+
126
+ lus-Latn
127
+
128
+ Roman
129
+
130
+ ~1M Sentences
131
+
132
+ Native
133
+
134
+ Nyishi
135
+
136
+ njz-Latn
137
+
138
+ Roman
139
+
140
+ ~55k Sentences
141
+
142
+ Oversampled (20x)
143
+
144
+ Garo
145
+
146
+ grt-Latn
147
+
148
+ Roman
149
+
150
+ ~10k Sentences
151
+
152
+ Oversampled (20x)
153
+
154
+ Pnar
155
+
156
+ pbv-Latn
157
 
158
+ Roman
159
 
160
+ ~1k Sentences
161
+
162
+ Oversampled (100x)
163
+
164
+ Kokborok
165
+
166
+ trp-Latn
167
+
168
+ Roman
169
+
170
+ ~2.5k Sentences
171
+
172
+ Oversampled (100x)
173
+
174
+ Anchor Languages
175
+
176
+ eng-Latn/hin-Deva
177
+
178
+ Roman/Devanagari
179
+
180
+ ~660k Sentences
181
+
182
+ Downsampled
183
+
184
+ Note on Oversampling
185
+
186
+ To address the extreme data imbalance (e.g., 1k Pnar sentences vs 3M Hindi sentences), we applied aggressive upsampling to micro-languages. To prevent overfitting on these repeated examples, we utilized Dynamic Masking during training. This ensures that the model sees different masking patterns for the same sentence across epochs, forcing it to learn semantic relationships rather than memorizing token sequences.
187
+
188
+ Evaluation and Benchmarks: Regional SOTA
189
 
190
  We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
191
 
192
+ 1. Effectiveness: Perplexity (PPL)
193
 
194
  Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
195
 
196
  <div align="center">
197
+ <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ppl_benchmark_chart.png" alt="Perplexity Benchmark Chart" width="800"/>
198
  </div>
199
 
200
+ Language
201
+
202
+ NE-BERT
203
+
204
+ mBERT
205
+
206
+ IndicBERT
207
+
208
+ Verdict
209
+
210
+ Pnar (pbv)
211
+
212
+ 2.51
213
+
214
+ 3.74
215
+
216
+ 8.25
217
+
218
+ 3x Better than IndicBERT
219
+
220
+ Khasi (kha)
221
+
222
+ 2.58
223
+
224
+ 2.94
225
+
226
+ 6.16
227
+
228
+ Best Specialized Model
229
+
230
+ Kokborok (trp)
231
+
232
+ 2.67
233
 
234
+ 3.79
235
+
236
+ 7.91
237
+
238
+ Strong SOTA
239
+
240
+ Assamese (asm)
241
+
242
+ 4.19
243
+
244
+ 2.34
245
+
246
+ 7.26
247
+
248
+ Competitive
249
+
250
+ Mizo (lus)
251
+
252
+ 3.09
253
+
254
+ 3.13
255
+
256
+ 6.45
257
+
258
+ Best Specialized Model
259
+
260
+ Garo (grt)
261
+
262
+ 3.80
263
+
264
+ 3.32
265
+
266
+ 8.64
267
+
268
+ Crushes IndicBERT
269
+
270
+ 2. Efficiency: Token Fertility (Inference Speed)
271
 
272
  Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
273
 
274
  <div align="center">
275
+ <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/fertility_benchmark_chart.png" alt="Token Fertility Benchmark Chart" width="600"/>
276
  </div>
277
 
278
+ Result: NE-BERT is 2x to 3x more token-efficient on major languages than mBERT and XLM-R, translating directly to faster inference and lower VRAM consumption in production.
 
 
279
 
280
+ Training Performance
281
 
282
  <div align="center">
283
+ <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_loss_chart.png" alt="Training Convergence Chart" width="800"/>
284
  </div>
285
 
286
+ Final Training Loss: 1.62
287
+
288
+ Final Validation Loss: 1.64
289
+
290
+ Convergence: The model achieved optimal convergence where validation loss tracked closely with training loss, indicating robust generalization despite the small dataset size of rare languages.
291
+
292
+ Technical Specifications
293
+
294
+ Architecture: ModernBERT-Base (Pre-Norm, Rotary Embeddings)
295
 
296
+ Parameters: ~149 Million
297
 
298
+ Context Window: 1024 Tokens
299
+
300
+ Tokenizer: Custom Unigram SentencePiece (Vocab: 50,368)
301
+
302
+ Training Hardware: NVIDIA A40 (48GB)
303
+
304
+ Training Duration: 10 Epochs
305
+
306
+ Limitations and Bias
307
 
 
308
  While NE-BERT significantly outperforms existing models on these languages, users should be aware:
 
 
309
 
310
+ Meitei Anchor Leak: Qualitative testing revealed a tendency to default to Hindi words when confused in Meitei, due to the shared Bengali script and high-frequency anchor data.
311
+
312
+ Domain Specificity: The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
313
+
314
+ Citation
315
+
316
  If you use this model in your research, please cite:
317
 
 
318
  @misc{ne-bert-2025,
319
  author = {MWirelabs},
320
  title = {NE-BERT: A Multilingual ModernBERT for Northeast India},
 
322
  publisher = {Hugging Face},
323
  journal = {Hugging Face Model Hub},
324
  howpublished = {\url{[https://huggingface.co/MWirelabs/ne-bert](https://huggingface.co/MWirelabs/ne-bert)}}
325
+ }