Update README.md
Browse files
README.md
CHANGED
|
@@ -15,11 +15,10 @@ tags:
|
|
| 15 |
- masked-language-modeling
|
| 16 |
- northeast-india
|
| 17 |
- low-resource-nlp
|
|
|
|
| 18 |
- mwirelabs
|
| 19 |
- token-efficiency
|
| 20 |
license: cc-by-4.0
|
| 21 |
-
datasets:
|
| 22 |
-
- MWirelabs/NE-BERT-Raw-Corpus
|
| 23 |
pipeline_tag: fill-mask
|
| 24 |
model-index:
|
| 25 |
- name: NE-BERT
|
|
@@ -48,7 +47,7 @@ inference:
|
|
| 48 |
|
| 49 |
# NE-BERT: Northeast India's Multilingual ModernBERT
|
| 50 |
|
| 51 |
-
**NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves **Regional State-of-the-Art (SOTA)** performance and **2x to 3x faster inference** compared to general multilingual models.
|
| 52 |
|
| 53 |
Built on the **ModernBERT** architecture, it supports a context length of **1024 tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
|
| 54 |
|
|
@@ -83,7 +82,17 @@ To address the extreme data imbalance (e.g., 1k Pnar sentences vs 3M Hindi sente
|
|
| 83 |
|
| 84 |
We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
|
| 85 |
|
| 86 |
-
### 1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
|
| 89 |
|
|
@@ -100,7 +109,7 @@ Perplexity measures the model's fluency and understanding of text (lower is bett
|
|
| 100 |
| **Mizo** (`lus`) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
|
| 101 |
| **Garo** (`grt`) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |
|
| 102 |
|
| 103 |
-
###
|
| 104 |
|
| 105 |
Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
|
| 106 |
|
|
@@ -133,7 +142,7 @@ Token Fertility (Tokens per Word) is the key metric for inference speed and memo
|
|
| 133 |
|
| 134 |
## Limitations and Bias
|
| 135 |
While NE-BERT significantly outperforms existing models on these languages, users should be aware:
|
| 136 |
-
* **Meitei
|
| 137 |
* **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
|
| 138 |
|
| 139 |
## Citation
|
|
|
|
| 15 |
- masked-language-modeling
|
| 16 |
- northeast-india
|
| 17 |
- low-resource-nlp
|
| 18 |
+
- northeast bert
|
| 19 |
- mwirelabs
|
| 20 |
- token-efficiency
|
| 21 |
license: cc-by-4.0
|
|
|
|
|
|
|
| 22 |
pipeline_tag: fill-mask
|
| 23 |
model-index:
|
| 24 |
- name: NE-BERT
|
|
|
|
| 47 |
|
| 48 |
# NE-BERT: Northeast India's Multilingual ModernBERT
|
| 49 |
|
| 50 |
+
**NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves strong **Regional State-of-the-Art (SOTA)** performance across multiple Northeast Indian languages and **2x to 3x faster inference** compared to general multilingual models.
|
| 51 |
|
| 52 |
Built on the **ModernBERT** architecture, it supports a context length of **1024 tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
|
| 53 |
|
|
|
|
| 82 |
|
| 83 |
We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
|
| 84 |
|
| 85 |
+
### 1. The "Eye Test": Qualitative Comparison
|
| 86 |
+
|
| 87 |
+
The superiority of NE-BERT is evident when predicting missing words in low-resource languages. While generic models predict punctuation or sub-word fragments, NE-BERT predicts coherent, culturally relevant words.
|
| 88 |
+
|
| 89 |
+
| Language | Input Sentence | **NE-BERT (Ours)** | mBERT | IndicBERT | XLM-R |
|
| 90 |
+
| :--- | :--- | :--- | :--- | :--- | :--- |
|
| 91 |
+
| **Assamese** | `মই ভাত <mask> ভাল পাওঁ।` <br>*(I like to [eat] rice)* | **খাই** (Eat) <br> *Correct Verb* | `##ি` <br> *Fragment* | `,` <br> *Punctuation* | `খুব` <br> *Adverb* |
|
| 92 |
+
| **Khasi** | `Nga leit sha <mask>.` <br>*(I go to [home/market])* | **iing** (Home) <br> *Correct Noun* | `.` <br> *Period* | `s` <br> *Character* | `Allah` <br> *Hallucination* |
|
| 93 |
+
| **Garo** | `Anga <mask> cha·jok.` <br>*(I [ate] ...)* | **nokni** (Of house) <br> *Real Word* | `-` <br> *Symbol* | `.` <br> *Period* | `,` <br> *Comma* |
|
| 94 |
+
|
| 95 |
+
### 2. Effectiveness: Perplexity (PPL)
|
| 96 |
|
| 97 |
Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
|
| 98 |
|
|
|
|
| 109 |
| **Mizo** (`lus`) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
|
| 110 |
| **Garo** (`grt`) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |
|
| 111 |
|
| 112 |
+
### 3. Efficiency: Token Fertility (Inference Speed)
|
| 113 |
|
| 114 |
Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
|
| 115 |
|
|
|
|
| 142 |
|
| 143 |
## Limitations and Bias
|
| 144 |
While NE-BERT significantly outperforms existing models on these languages, users should be aware:
|
| 145 |
+
* **Meitei/Hindi Leakage:** Due to the shared script and the high volume of Hindi anchor data, the model may sometimes predict Hindi/Sanskrit words (e.g., "Narayan") in Meitei contexts if the sentence structure is ambiguous.
|
| 146 |
* **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
|
| 147 |
|
| 148 |
## Citation
|