File size: 6,939 Bytes
24d1728
42f3fb3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24d1728
 
42f3fb3
24d1728
42f3fb3
 
 
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
 
 
 
 
 
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
 
 
 
 
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24d1728
 
42f3fb3
 
 
 
 
 
 
 
 
 
 
 
 
24d1728
42f3fb3
24d1728
42f3fb3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24d1728
42f3fb3
24d1728
 
 
42f3fb3
 
 
 
 
 
24d1728
42f3fb3
 
 
 
 
 
 
 
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
 
 
 
 
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
 
 
 
 
 
 
 
 
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
 
 
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
 
 
24d1728
42f3fb3
 
24d1728
42f3fb3
24d1728
42f3fb3
24d1728
42f3fb3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
language:
- nag
license: cc-by-4.0
tags:
- bert
- roberta
- nagamese
- low-resource
- creole
- northeast-india
- token-classification
- fill-mask
datasets:
- agnivamaiti/naganlp-ner-annotated-corpus
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: NagameseBERT
  results:
  - task:
      type: token-classification
      name: Part-of-Speech Tagging
    dataset:
      name: NagaNLP Annotated Corpus
      type: agnivamaiti/naganlp-ner-annotated-corpus
    metrics:
    - type: accuracy
      value: 88.35
      name: Accuracy
    - type: f1
      value: 80.72
      name: F1 (macro)
  - task:
      type: token-classification
      name: Named Entity Recognition
    dataset:
      name: NagaNLP Annotated Corpus
      type: agnivamaiti/naganlp-ner-annotated-corpus
    metrics:
    - type: accuracy
      value: 91.74
      name: Accuracy
    - type: f1
      value: 56.51
      name: F1 (macro)
---

# NagameseBERT

[![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow)](https://huggingface.co/MWirelabs/nagamesebert)
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/)
[![Language](https://img.shields.io/badge/Language-Nagamese-green)](https://en.wikipedia.org/wiki/Nagamese_Creole)

**A Foundational BERT model for Nagamese Creole** - A compact, efficient language model for a low resource Northeast Indian language.

---

## Overview

NagameseBERT is a 7M parameter RoBERTa-style BERT model pre-trained on 42,552 Nagamese sentences. Despite being 15× smaller than multilingual models like mBERT (110M) and XLM-RoBERTa (125M), it achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages.

**Key Features:**
- **Compact**: 6.9M parameters (15× smaller than mBERT)
- **Efficient**: Pre-trained in 35 minutes on single A40 GPU
- **Custom tokenizer**: 8K BPE vocabulary optimized for Nagamese
- **Rigorous evaluation**: Multi-seed testing (n=3) with reproducible results
- **Open**: Model, code, and data splits publicly available

---

## Performance

Multi-seed evaluation results (mean ± std, n=3):

| Model | Parameters | POS Accuracy | POS F1 | NER Accuracy | NER F1 |
|-------|-----------|--------------|--------|--------------|--------|
| **NagameseBERT** | **7M** | **88.35 ± 0.71%** | **0.807 ± 0.013** | **91.74 ± 0.68%** | **0.565 ± 0.054** |
| mBERT | 110M | 95.14 ± 0.47% | 0.916 ± 0.008 | 96.11 ± 0.72% | 0.750 ± 0.064 |
| XLM-RoBERTa | 125M | 95.64 ± 0.56% | 0.919 ± 0.008 | 96.38 ± 0.26% | 0.819 ± 0.066 |

**Trade-off**: 6-7 percentage points lower accuracy with 15× parameter reduction, enabling resource-constrained deployment.

---

## Model Details

### Architecture
- **Type**: RoBERTa-style BERT (no token type embeddings)
- **Hidden size**: 256
- **Layers**: 6 transformer blocks
- **Attention heads**: 4 per layer
- **Intermediate size**: 1,024
- **Max sequence length**: 64 tokens
- **Total parameters**: 6,878,528

### Tokenizer
- **Type**: Byte-Pair Encoding (BPE)
- **Vocabulary size**: 8,000 tokens
- **Special tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
- **Normalization**: NFD Unicode + accent stripping
- **Case**: Preserved (for proper nouns and code-switched English)

### Training Data
- **Corpus size**: 42,552 Nagamese sentences
- **Average length**: 11.82 tokens/sentence
- **Split**: 90% train (38,296) / 10% validation (4,256)
- **Sources**: Web, social media, community contributions (deduplicated)

### Pre-training
- **Objective**: Masked Language Modeling (15% masking)
- **Optimizer**: AdamW (lr=5e-4, weight_decay=0.01)
- **Batch size**: 64
- **Epochs**: 50
- **Training time**: ~35 minutes
- **Hardware**: NVIDIA A40 (48GB)
- **Final validation loss**: 2.79

---

## Usage

### Load Model and Tokenizer
```python
from transformers import AutoTokenizer, AutoModel

model_name = "MWirelabs/nagamesebert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Example usage
text = "Toi moi laga sathi hobo pare?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```

### Fine-tuning for Token Classification
```python
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

# Load model with classification head
model = AutoModelForTokenClassification.from_pretrained(
    "MWirelabs/nagamesebert",
    num_labels=num_labels
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=100,
    per_device_train_batch_size=8,
    learning_rate=3e-5,
    weight_decay=0.01
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)
trainer.train()
```

---

## Evaluation

### Dataset
- **Source**: [NagaNLP Annotated Corpus](https://huggingface.co/datasets/agnivamaiti/naganlp-ner-annotated-corpus)
- **Total**: 214 sentences
- **Split** (seed=42): 171 train / 21 dev / 22 test (80/10/10)
- **POS tags**: 13 Universal Dependencies tags
- **NER tags**: 4 entity types (PER, LOC, ORG, MISC) in IOB2 format

### Experimental Setup
- **Seeds**: 42, 123, 456 (n=3 for variance estimation)
- **Batch size**: 32
- **Learning rate**: 3e-5
- **Epochs**: 100
- **Optimization**: AdamW with 100 warmup steps
- **Hardware**: NVIDIA A40
- **Metrics**: Token-level accuracy and macro-averaged F1

**Data Leakage Statement**: All splits created with fixed seed (42) with no sentence overlap between train/dev/test sets.

---

## Limitations

- **Corpus size**: 42K sentences is modest; expansion to 100K+ could improve performance
- **Evaluation scale**: Small test set (22 sentences) limits statistical power
- **Task scope**: Only evaluated on token classification; needs broader task assessment
- **Efficiency metrics**: No quantitative inference benchmarks (latency, memory) yet provided
- **Data documentation**: Complete data provenance and licenses to be formalized

---

## Citation

If you use NagameseBERT in your research, please cite:
```bibtex
@misc{nagamesebert2025,
  title={Bootstrapping BERT for Nagamese: A Low-Resource Creole Language},
  author={MWire Labs},
  year={2025},
  url={https://huggingface.co/MWirelabs/nagamesebert}
}
```

---

## Contact

**MWire Labs**  
Shillong, Meghalaya, India  
Website: [MWire Labs](https://mwirelabs.com)

---

## License

This model is released under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).

You are free to:
- **Share** — copy and redistribute the material
- **Adapt** — remix, transform, and build upon the material

Under the following terms:
- **Attribution** — You must give appropriate credit to MWire Labs

---

## Acknowledgments

We thank the Nagamese-speaking community for their contributions to corpus development and validation.