Badnyal commited on
Commit
0d656e7
·
verified ·
1 Parent(s): 19369ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -6
README.md CHANGED
@@ -15,11 +15,10 @@ tags:
15
  - masked-language-modeling
16
  - northeast-india
17
  - low-resource-nlp
 
18
  - mwirelabs
19
  - token-efficiency
20
  license: cc-by-4.0
21
- datasets:
22
- - MWirelabs/NE-BERT-Raw-Corpus
23
  pipeline_tag: fill-mask
24
  model-index:
25
  - name: NE-BERT
@@ -48,7 +47,7 @@ inference:
48
 
49
  # NE-BERT: Northeast India's Multilingual ModernBERT
50
 
51
- **NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves **Regional State-of-the-Art (SOTA)** performance and **2x to 3x faster inference** compared to general multilingual models.
52
 
53
  Built on the **ModernBERT** architecture, it supports a context length of **1024 tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
54
 
@@ -83,7 +82,17 @@ To address the extreme data imbalance (e.g., 1k Pnar sentences vs 3M Hindi sente
83
 
84
  We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
85
 
86
- ### 1. Effectiveness: Perplexity (PPL)
 
 
 
 
 
 
 
 
 
 
87
 
88
  Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
89
 
@@ -100,7 +109,7 @@ Perplexity measures the model's fluency and understanding of text (lower is bett
100
  | **Mizo** (`lus`) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
101
  | **Garo** (`grt`) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |
102
 
103
- ### 2. Efficiency: Token Fertility (Inference Speed)
104
 
105
  Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
106
 
@@ -133,7 +142,7 @@ Token Fertility (Tokens per Word) is the key metric for inference speed and memo
133
 
134
  ## Limitations and Bias
135
  While NE-BERT significantly outperforms existing models on these languages, users should be aware:
136
- * **Meitei Anchor Leak:** Qualitative testing revealed a tendency to default to Hindi words when confused in Meitei, due to the shared Bengali script and high-frequency anchor data.
137
  * **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
138
 
139
  ## Citation
 
15
  - masked-language-modeling
16
  - northeast-india
17
  - low-resource-nlp
18
+ - northeast bert
19
  - mwirelabs
20
  - token-efficiency
21
  license: cc-by-4.0
 
 
22
  pipeline_tag: fill-mask
23
  model-index:
24
  - name: NE-BERT
 
47
 
48
  # NE-BERT: Northeast India's Multilingual ModernBERT
49
 
50
+ **NE-BERT** is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. It achieves strong **Regional State-of-the-Art (SOTA)** performance across multiple Northeast Indian languages and **2x to 3x faster inference** compared to general multilingual models.
51
 
52
  Built on the **ModernBERT** architecture, it supports a context length of **1024 tokens**, utilizes Flash Attention 2 for high-efficiency inference, and treats Northeast languages as first-class citizens.
53
 
 
82
 
83
  We evaluated NE-BERT against industry-standard multilingual models (mBERT, XLM-R, IndicBERT) on a final, complex, held-out test set to ensure reproducibility and rigor.
84
 
85
+ ### 1. The "Eye Test": Qualitative Comparison
86
+
87
+ The superiority of NE-BERT is evident when predicting missing words in low-resource languages. While generic models predict punctuation or sub-word fragments, NE-BERT predicts coherent, culturally relevant words.
88
+
89
+ | Language | Input Sentence | **NE-BERT (Ours)** | mBERT | IndicBERT | XLM-R |
90
+ | :--- | :--- | :--- | :--- | :--- | :--- |
91
+ | **Assamese** | `মই ভাত <mask> ভাল পাওঁ।` <br>*(I like to [eat] rice)* | **খাই** (Eat) <br> *Correct Verb* | `##ি` <br> *Fragment* | `,` <br> *Punctuation* | `খুব` <br> *Adverb* |
92
+ | **Khasi** | `Nga leit sha <mask>.` <br>*(I go to [home/market])* | **iing** (Home) <br> *Correct Noun* | `.` <br> *Period* | `s` <br> *Character* | `Allah` <br> *Hallucination* |
93
+ | **Garo** | `Anga <mask> cha·jok.` <br>*(I [ate] ...)* | **nokni** (Of house) <br> *Real Word* | `-` <br> *Symbol* | `.` <br> *Period* | `,` <br> *Comma* |
94
+
95
+ ### 2. Effectiveness: Perplexity (PPL)
96
 
97
  Perplexity measures the model's fluency and understanding of text (lower is better). This comparison proves NE-BERT's superior language modeling across the board, particularly in low-resource settings.
98
 
 
109
  | **Mizo** (`lus`) | **3.09** | 3.13 | 6.45 | **Best Specialized Model** |
110
  | **Garo** (`grt`) | **3.80** | 3.32 | 8.64 | **Crushes IndicBERT** |
111
 
112
+ ### 3. Efficiency: Token Fertility (Inference Speed)
113
 
114
  Token Fertility (Tokens per Word) is the key metric for inference speed and memory footprint (lower is better). NE-BERT's custom Unigram tokenizer delivers massive efficiency gains.
115
 
 
142
 
143
  ## Limitations and Bias
144
  While NE-BERT significantly outperforms existing models on these languages, users should be aware:
145
+ * **Meitei/Hindi Leakage:** Due to the shared script and the high volume of Hindi anchor data, the model may sometimes predict Hindi/Sanskrit words (e.g., "Narayan") in Meitei contexts if the sentence structure is ambiguous.
146
  * **Domain Specificity:** The model is trained largely on general web text. It may struggle with highly technical or poetic domains in micro-languages due to limited data size.
147
 
148
  ## Citation