MWirelabs
/

ne-bert

@@ -15,11 +15,10 @@ tags:
 - masked-language-modeling
 - northeast-india
 - low-resource-nlp
 - mwirelabs
 - token-efficiency
 license: cc-by-4.0
-datasets:
-- MWirelabs/NE-BERT-Raw-Corpus
 pipeline_tag: fill-mask
 model-index:
 - name: NE-BERT
@@ -48,7 +47,7 @@ inference:
 # NE-BERT: Northeast India's Multilingual ModernBERT
-**NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves **Regional State-of-the-Art (SOTA)** performance and **2x to 3x faster inference** compared to general multilingual models.
 Built on the **ModernBERT** architecture, it supports a context length of **1024 tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
@@ -83,7 +82,17 @@ To address the extreme data imbalance (e.g., 1k Pnar sentences vs 3M Hindi sente
 We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
-### 1. Effectiveness: Perplexity (PPL)
 Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
@@ -100,7 +109,7 @@ Perplexity measures the model's fluency and understanding of text (lower is bett
 | **Mizo** (`lus`) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
 | **Garo** (`grt`) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |
-### 2. Efficiency: Token Fertility (Inference Speed)
 Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
@@ -133,7 +142,7 @@ Token Fertility (Tokens per Word) is the key metric for inference speed and memo
 ## Limitations and Bias
 While NE-BERT significantly outperforms existing models on these languages, users should be aware:
-* **Meitei Anchor Leak:** Qualitative testing revealed a tendency to default to Hindi words when confused in Meitei, due to the shared Bengali script and high-frequency anchor data.
 * **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
 ## Citation

 - masked-language-modeling
 - northeast-india
 - low-resource-nlp
+- northeast bert
 - mwirelabs
 - token-efficiency
 license: cc-by-4.0
 pipeline_tag: fill-mask
 model-index:
 - name: NE-BERT
 # NE-BERT: Northeast India's Multilingual ModernBERT
+**NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves strong **Regional State-of-the-Art (SOTA)** performance across multiple Northeast Indian languages and **2x to 3x faster inference** compared to general multilingual models.
 Built on the **ModernBERT** architecture, it supports a context length of **1024 tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
 We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
+### 1. The "Eye Test": Qualitative Comparison
+The superiority of NE-BERT is evident when predicting missing words in low-resource languages. While generic models predict punctuation or sub-word fragments, NE-BERT predicts coherent, culturally relevant words.
+| Language | Input Sentence | **NE-BERT (Ours)** | mBERT | IndicBERT | XLM-R |
+| :--- | :--- | :--- | :--- | :--- | :--- |
+| **Assamese** | `মই ভাত <mask> ভাল পাওঁ।` <br>*(I like to [eat] rice)* | **খাই** (Eat) <br> *Correct Verb* | `##ি` <br> *Fragment* | `,` <br> *Punctuation* | `খুব` <br> *Adverb* |
+| **Khasi** | `Nga leit sha <mask>.` <br>*(I go to [home/market])* | **iing** (Home) <br> *Correct Noun* | `.` <br> *Period* | `s` <br> *Character* | `Allah` <br> *Hallucination* |
+| **Garo** | `Anga <mask> cha·jok.` <br>*(I [ate] ...)* | **nokni** (Of house) <br> *Real Word* | `-` <br> *Symbol* | `.` <br> *Period* | `,` <br> *Comma* |
+### 2. Effectiveness: Perplexity (PPL)
 Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
 | **Mizo** (`lus`) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
 | **Garo** (`grt`) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |
+### 3. Efficiency: Token Fertility (Inference Speed)
 Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
 ## Limitations and Bias
 While NE-BERT significantly outperforms existing models on these languages, users should be aware:
+* **Meitei/Hindi Leakage:** Due to the shared script and the high volume of Hindi anchor data, the model may sometimes predict Hindi/Sanskrit words (e.g., "Narayan") in Meitei contexts if the sentence structure is ambiguous.
 * **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
 ## Citation