ne-bert / README.md
Badnyal's picture
Update README.md
f7964c7 verified
---
language:
- asm
- mni
- kha
- lus
- grt
- trp
- njz
- pbv
- nag
- eng
- hin
tags:
- modernbert
- masked-language-modeling
- northeast-india
- low-resource-nlp
- northeast bert
- mwirelabs
- token-efficiency
license: cc-by-4.0
pipeline_tag: fill-mask
model-index:
- name: NE-BERT
results:
- task:
type: masked-language-modeling
name: Masked Language Modeling
dataset:
name: NE-BERT Evaluation Corpus
type: synthetic
metrics:
- name: Perplexity
type: perplexity
value: 2.9811
widget:
- text: "Nga leit sha <mask>."
example_title: "Khasi (Location)"
- text: "মই ভাত <mask> ভাল পাওঁ।"
example_title: "Assamese (Action)"
- text: "Anga <mask> cha·jok."
example_title: "Garo (Food)"
inference:
parameters:
mask_token: "<mask>"
---
<p align="center">
<!-- Model -->
<img alt="Model" src="https://img.shields.io/badge/Model-ModernBERT-0A84FF">
<!-- Task -->
<img alt="Task" src="https://img.shields.io/badge/Task-Masked%20Language%20Modeling-34C759">
<!-- Languages -->
<img alt="Languages" src="https://img.shields.io/badge/Supported%20Languages-9%20(+%20EN%2FHI)-AF52DE">
<!-- Region -->
<img alt="Region" src="https://img.shields.io/badge/Region-Northeast%20India-FF9F0A">
<!-- License -->
<img alt="License" src="https://img.shields.io/badge/License-CC--BY--4.0-FC5C65">
</p>
# NE-BERT: Northeast India's Multilingual ModernBERT
**NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves strong **Regional State-of-the-Art (SOTA)** performance across multiple Northeast Indian languages and **2x to 3x faster inference** compared to general multilingual models.
Built on the **ModernBERT** architecture, it supports a context length of **1024 tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
---
## Quick Start
NE-BERT is built on the **ModernBERT** architecture. You must use `transformers>=4.48.0`.
```python
# First, install the library:
# pip install -U transformers
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
# Load NE-BERT (No remote code needed for transformers >= 4.48)
model_name = "MWirelabs/ne-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Example: Nagamese Creole (ISO: nag)
text = "Moi bhat <mask>." # "I [eat] rice"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
# Retrieve top prediction
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id))
# Output: "khai" (eat)
```
---
## Training Data & Strategy
NE-BERT was trained on a meticulously curated corpus using a **Smart-Weighted Sampling** strategy to ensure the low-resource languages were not drowned out by anchor languages.
<div align="center">
<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_data_dist.png" alt="Data Distribution Pie Chart" width="600"/>
</div>
| Language | HF Tag | Script | Corpus Size | Training Strategy |
| :--- | :--- | :--- | :--- | :--- |
| **Assamese** | `asm-Beng` | Bengali-Assamese | ~1M Sentences | Native |
| **Meitei (Manipuri)** | `mni-Beng` | Bengali-Assamese | ~1.3M Sentences | Native |
| **Khasi** | `kha-Latn` | Roman | ~1M Sentences | Native |
| **Mizo** | `lus-Latn` | Roman | ~1M Sentences | Native |
| **Nyishi** | `njz-Latn` | Roman | ~55k Sentences | **Oversampled** (20x) |
| **Nagamese** | `nag-Latn` | Roman | ~13k Sentences | **Oversampled** (20x) |
| **Garo** | `grt-Latn` | Roman | ~10k Sentences | **Oversampled** (20x) |
| **Pnar** | `pbv-Latn` | Roman | ~1k Sentences | **Oversampled** (100x) |
| **Kokborok** | `trp-Latn` | Roman | ~2.5k Sentences | **Oversampled** (100x) |
| **Anchor Languages** | `eng-Latn`/`hin-Deva` | Roman/Devanagari | ~660k Sentences | Downsampled |
### Note on Oversampling
To address the extreme data imbalance (e.g., 1k Pnar sentences vs 3M Hindi sentences), we applied aggressive upsampling to micro-languages. To prevent overfitting on these repeated examples, we utilized **Dynamic Masking** during training. This ensures that the model sees different masking patterns for the same sentence across epochs, forcing it to learn semantic relationships rather than memorizing token sequences.
---
## Evaluation and Benchmarks: Regional SOTA
We evaluated NE-BERT against industry-standard multilingual models (mBERT and IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
### 1. The "Eye Test": Qualitative Comparison
The superiority of NE-BERT is evident when predicting missing words in low-resource languages. While generic models predict punctuation or sub-word fragments, NE-BERT predicts coherent, culturally relevant words.
| Language | Input Sentence | **NE-BERT (Ours)** | mBERT | IndicBERT |
| :--- | :--- | :--- | :--- | :--- |
| **Assamese** | `মই ভাত <mask> ভাল পাওঁ।` <br>*(I like to [eat] rice)* | **খাই** (Eat) <br> *Correct Verb* | `##ি` <br> *Fragment* | `,` <br> *Punctuation* |
| **Khasi** | `Nga leit sha <mask>.` <br>*(I go to [home/market])* | **iing** (Home) <br> *Correct Noun* | `.` <br> *Period* | `s` <br> *Character* |
| **Garo** | `Anga <mask> cha·jok.` <br>*(I [ate] ...)* | **nokni** (Of house) <br> *Real Word* | `-` <br> *Symbol* | `.` <br> *Period* |
### 2. Effectiveness: Perplexity (PPL)
Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
<div align="center">
<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ppl_benchmark_chart.png" alt="Perplexity Benchmark Chart" width="800"/>
</div>
| Language | **NE-BERT** | mBERT | IndicBERT | Verdict |
| :--- | :--- | :--- | :--- | :--- |
| **Pnar** (`pbv`) | **2.51** | 3.74 | 8.25 | **3x Better than IndicBERT** |
| **Khasi** (`kha`) | **2.58** | 2.94 | 6.16 | **Best Specialized Model** |
| **Kokborok** (`trp`) | **2.67** | 3.79 | 7.91 | **Strong SOTA** |
| **Assamese** (`asm`) | 4.19 | **2.34** | 7.26 | *Competitive* |
| **Mizo** (`lus`) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
| **Garo** (`grt`) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |
### 3. Efficiency: Token Fertility (Inference Speed)
Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
<div align="center">
<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/fertility_benchmark_chart.png" alt="Token Fertility Benchmark Chart" width="600"/>
</div>
*Result: NE-BERT is **2x to 3x more token-efficient** on major languages than mBERT and IndicBERT, translating directly to **faster inference** and **lower VRAM consumption** in production.*
---
## Training Performance
<div align="center">
<img src="https://huggingface.co/MWirelabs/ne-bert/resolve/main/ne_bert_loss_chart.png" alt="Training Convergence Chart" width="800"/>
</div>
* **Final Training Loss:** 1.62
* **Final Validation Loss:** 1.64
* **Convergence:** The model achieved optimal convergence where validation loss tracked closely with training loss, indicating robust generalization despite the small dataset size of rare languages.
## Technical Specifications
* **Architecture:** ModernBERT-Base (Pre-Norm, Rotary Embeddings)
* **Parameters:** ~149 Million
* **Context Window:** **1024 Tokens**
* **Tokenizer:** Custom Unigram SentencePiece (Vocab: 50,368)
* **Training Hardware:** NVIDIA A40 (48GB)
* **Training Duration:** 10 Epochs
## Limitations and Bias
While NE-BERT significantly outperforms existing models on these languages, users should be aware:
* **Meitei/Hindi Leakage:** Due to the shared script and the high volume of Hindi anchor data, the model may sometimes predict Hindi/Sanskrit words (e.g., "Narayan") in Meitei contexts if the sentence structure is ambiguous.
* **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{ne-bert-2025,
author = {MWirelabs},
title = {NE-BERT: A Multilingual ModernBERT for Northeast India},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{[https://huggingface.co/MWirelabs/ne-bert](https://huggingface.co/MWirelabs/ne-bert)}}
}