MWirelabs
/

ne-bert

@@ -1,140 +1,320 @@
----
 language:
-- asm
-- mni
-- kha
-- lus
-- grt
-- trp
-- njz
-- pbv
-- eng
-- hin
 tags:
-- modernbert
-- masked-language-modeling
-- northeast-india
-- low-resource-nlp
-- mwirelabs
-- token-efficiency
 license: cc-by-4.0
 pipeline_tag: fill-mask
 model-index:
-- name: NE-BERT
-  results:
-  - task:
-      type: masked-language-modeling
-      name: Masked Language Modeling
-    dataset:
-      name: NE-BERT Evaluation Corpus
-      type: synthetic
-    metrics:
-    - name: Perplexity
-      type: perplexity
-      value: 2.9811
 widget:
-- text: "Nga leit sha <mask>."
-  example_title: "Khasi (Location)"
-- text: "মই ভাত <mask> ভাল পাওঁ।"
-  example_title: "Assamese (Action)"
-- text: "Anga <mask> cha·jok."
-  example_title: "Garo (Food)"
-inference:
-  parameters:
-    mask_token: "<mask>"
----
-# NE-BERT: Northeast India's Multilingual ModernBERT 🚀
-**NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves **Regional State-of-the-Art (SOTA)** performance and **$2\text{x}$ to $3\text{x}$ faster inference** compared to general multilingual models.
-Built on the **ModernBERT** architecture, it supports a context length of **$1024$ tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
----
-## 💾 Training Data & Strategy
-NE-BERT was trained on a meticulously curated corpus using a **Smart-Weighted Sampling** strategy to ensure the low-resource languages were not drowned out by anchor languages.
 <div align="center">
-  <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_data_dist.png" alt="Data Distribution Pie Chart" width="600"/>
 </div>
-| Language | HF Tag | Script | Corpus Size | Training Strategy |
-| :--- | :--- | :--- | :--- | :--- |
-| **Assamese** | `asm-Beng` | Bengali-Assamese | $\approx 1\text{M}$ Sentences | Native |
-| **Meitei (Manipuri)** | `mni-Beng` | Bengali-Assamese | $\approx 1.3\text{M}$ Sentences | Native |
-| **Khasi** | `kha-Latn` | Roman | $\approx 1\text{M}$ Sentences | Native |
-| **Mizo** | `lus-Latn` | Roman | $\approx 1\text{M}$ Sentences | Native |
-| **Nyishi** | `njz-Latn` | Roman | $\approx 55\text{k}$ Sentences | **Oversampled** ($20\text{x}$) |
-| **Garo** | `grt-Latn` | Roman | $\approx 10\text{k}$ Sentences | **Oversampled** ($20\text{x}$) |
-| **Pnar** | `pbv-Latn` | Roman | $\approx 1\text{k}$ Sentences | **Oversampled** ($100\text{x}$) |
-| **Kokborok** | `trp-Latn` | Roman | $\approx 2.5\text{k}$ Sentences | **Oversampled** ($100\text{x}$) |
-| **Anchor Languages** | `eng-Latn`/`hin-Deva` | Roman/Devanagari | $\approx 660\text{k}$ Sentences | Downsampled |
----
-## 📈 Evaluation and Benchmarks: Regional SOTA
 We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
-### 1. Effectiveness: Perplexity (PPL)
 Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
 <div align="center">
-  <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ppl_benchmark_chart.png" alt="Perplexity Benchmark Chart" width="800"/>
 </div>
-| Language | **NE-BERT** | mBERT | IndicBERT | Verdict |
-| :--- | :--- | :--- | :--- | :--- |
-| **Pnar** ($\text{pbv}$) | **2.51** | 3.74 | 8.25 | **$3\times$ Better than IndicBERT** |
-| **Khasi** ($\text{kha}$) | **2.58** | 2.94 | 6.16 | **Best Specialized Model** |
-| **Kokborok** ($\text{trp}$) | **2.67** | 3.79 | 7.91 | **Strong SOTA** |
-| Assamese ($\text{asm}$) | 4.19 | **2.34** | 7.26 | *Competitive/Best Specialized Model* |
-| Mizo ($\text{lus}$) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
-| **Garo** ($\text{grt}$) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |
-### 2. Efficiency: Token Fertility (Inference Speed)
 Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
 <div align="center">
-  <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/fertility_benchmark_chart.png" alt="Token Fertility Benchmark Chart" width="600"/>
 </div>
-*Result: NE-BERT is **$2\text{x}$ to $3\text{x}$ more token-efficient** on major languages than mBERT and XLM-R, translating directly to **faster inference** and **lower VRAM consumption** in production.*
----
-## Training Performance
 <div align="center">
-  <img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_loss_chart.png" alt="Training Convergence Chart" width="800"/>
 </div>
-* **Final Training Loss:** 1.62
-* **Final Validation Loss:** 1.64
-* **Convergence:** The model achieved optimal convergence where validation loss tracked closely with training loss, indicating robust generalization despite the small dataset size of rare languages.
-## Technical Specifications
-* **Architecture:** ModernBERT-Base (Pre-Norm, Rotary Embeddings)
-* **Parameters:** $\approx 149$ Million
-* **Context Window:** **$1024$ Tokens**
-* **Tokenizer:** Custom Unigram SentencePiece (Vocab: 50,368)
-* **Training Hardware:** NVIDIA A40 (48GB)
-* **Training Duration:** $10$ Epochs
-## Limitations and Bias
 While NE-BERT significantly outperforms existing models on these languages, users should be aware:
-* **Meitei Anchor Leak:** Qualitative testing revealed a tendency to default to Hindi words when confused in Meitei, due to the shared Bengali script and high-frequency anchor data.
-* **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
-## Citation
 If you use this model in your research, please cite:
-```bibtex
 @misc{ne-bert-2025,
   author = {MWirelabs},
   title = {NE-BERT: A Multilingual ModernBERT for Northeast India},
@@ -142,4 +322,4 @@ If you use this model in your research, please cite:
   publisher = {Hugging Face},
   journal = {Hugging Face Model Hub},
   howpublished = {\url{[https://huggingface.co/MWirelabs/ne-bert](https://huggingface.co/MWirelabs/ne-bert)}}
-}

 language:
+asm-Beng
+mni-Beng
+kha-Latn
+lus-Latn
+grt-Latn
+trp-Latn
+njz-Latn
+pbv-Latn
+eng-Latn
+hin-Deva
 tags:
+modernbert
+masked-language-modeling
+northeast-india
+low-resource-nlp
+mwirelabs
+token-efficiency
 license: cc-by-4.0
+datasets:
+MWirelabs/NE-BERT-Raw-Corpus
 pipeline_tag: fill-mask
 model-index:
+name: NE-BERT
+results:
+task:
+type: masked-language-modeling
+name: Masked Language Modeling
+dataset:
+name: NE-BERT Evaluation Corpus
+type: synthetic
+metrics:
+name: Perplexity
+type: perplexity
+value: 2.9811
 widget:
+text: "Nga leit sha <mask>."
+example_title: "Khasi (Location)"
+text: "মই ভাত <mask> ভাল পাওঁ।"
+example_title: "Assamese (Action)"
+text: "Anga <mask> cha·jok."
+example_title: "Garo (Food)"
+inference:
+parameters:
+mask_token: "<mask>"
+NE-BERT: Northeast India's Multilingual ModernBERT
+NE-BERT is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves Regional State-of-the-Art (SOTA) performance and 2x to 3x faster inference compared to general multilingual models.
+Built on the ModernBERT architecture, it supports a context length of 1024 tokens, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
+Training Data & Strategy
+NE-BERT was trained on a meticulously curated corpus using a Smart-Weighted Sampling strategy to ensure the low-resource languages were not drowned out by anchor languages.
 <div align="center">
+<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_data_dist.png" alt="Data Distribution Pie Chart" width="600"/>
 </div>
+Language
+HF Tag
+Script
+Corpus Size
+Training Strategy
+Assamese
+asm-Beng
+Bengali-Assamese
+~1M Sentences
+Native
+Meitei (Manipuri)
+mni-Beng
+Bengali-Assamese
+~1.3M Sentences
+Native
+Khasi
+kha-Latn
+Roman
+~1M Sentences
+Native
+Mizo
+lus-Latn
+Roman
+~1M Sentences
+Native
+Nyishi
+njz-Latn
+Roman
+~55k Sentences
+Oversampled (20x)
+Garo
+grt-Latn
+Roman
+~10k Sentences
+Oversampled (20x)
+Pnar
+pbv-Latn
+Roman
+~1k Sentences
+Oversampled (100x)
+Kokborok
+trp-Latn
+Roman
+~2.5k Sentences
+Oversampled (100x)
+Anchor Languages
+eng-Latn/hin-Deva
+Roman/Devanagari
+~660k Sentences
+Downsampled
+Note on Oversampling
+To address the extreme data imbalance (e.g., 1k Pnar sentences vs 3M Hindi sentences), we applied aggressive upsampling to micro-languages. To prevent overfitting on these repeated examples, we utilized Dynamic Masking during training. This ensures that the model sees different masking patterns for the same sentence across epochs, forcing it to learn semantic relationships rather than memorizing token sequences.
+Evaluation and Benchmarks: Regional SOTA
 We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
+1. Effectiveness: Perplexity (PPL)
 Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
 <div align="center">
+<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ppl_benchmark_chart.png" alt="Perplexity Benchmark Chart" width="800"/>
 </div>
+Language
+NE-BERT
+mBERT
+IndicBERT
+Verdict
+Pnar (pbv)
+2.51
+3.74
+8.25
+3x Better than IndicBERT
+Khasi (kha)
+2.58
+2.94
+6.16
+Best Specialized Model
+Kokborok (trp)
+2.67
+3.79
+7.91
+Strong SOTA
+Assamese (asm)
+4.19
+2.34
+7.26
+Competitive
+Mizo (lus)
+3.09
+3.13
+6.45
+Best Specialized Model
+Garo (grt)
+3.80
+3.32
+8.64
+Crushes IndicBERT
+2. Efficiency: Token Fertility (Inference Speed)
 Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
 <div align="center">
+<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/fertility_benchmark_chart.png" alt="Token Fertility Benchmark Chart" width="600"/>
 </div>
+Result: NE-BERT is 2x to 3x more token-efficient on major languages than mBERT and XLM-R, translating directly to faster inference and lower VRAM consumption in production.
+Training Performance
 <div align="center">
+<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_loss_chart.png" alt="Training Convergence Chart" width="800"/>
 </div>
+Final Training Loss: 1.62
+Final Validation Loss: 1.64
+Convergence: The model achieved optimal convergence where validation loss tracked closely with training loss, indicating robust generalization despite the small dataset size of rare languages.
+Technical Specifications
+Architecture: ModernBERT-Base (Pre-Norm, Rotary Embeddings)
+Parameters: ~149 Million
+Context Window: 1024 Tokens
+Tokenizer: Custom Unigram SentencePiece (Vocab: 50,368)
+Training Hardware: NVIDIA A40 (48GB)
+Training Duration: 10 Epochs
+Limitations and Bias
 While NE-BERT significantly outperforms existing models on these languages, users should be aware:
+Meitei Anchor Leak: Qualitative testing revealed a tendency to default to Hindi words when confused in Meitei, due to the shared Bengali script and high-frequency anchor data.
+Domain Specificity: The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
+Citation
 If you use this model in your research, please cite:
 @misc{ne-bert-2025,
   author = {MWirelabs},
   title = {NE-BERT: A Multilingual ModernBERT for Northeast India},
   publisher = {Hugging Face},
   journal = {Hugging Face Model Hub},
   howpublished = {\url{[https://huggingface.co/MWirelabs/ne-bert](https://huggingface.co/MWirelabs/ne-bert)}}
+}