File size: 6,939 Bytes
24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 24d1728 42f3fb3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 |
---
language:
- nag
license: cc-by-4.0
tags:
- bert
- roberta
- nagamese
- low-resource
- creole
- northeast-india
- token-classification
- fill-mask
datasets:
- agnivamaiti/naganlp-ner-annotated-corpus
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: NagameseBERT
results:
- task:
type: token-classification
name: Part-of-Speech Tagging
dataset:
name: NagaNLP Annotated Corpus
type: agnivamaiti/naganlp-ner-annotated-corpus
metrics:
- type: accuracy
value: 88.35
name: Accuracy
- type: f1
value: 80.72
name: F1 (macro)
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: NagaNLP Annotated Corpus
type: agnivamaiti/naganlp-ner-annotated-corpus
metrics:
- type: accuracy
value: 91.74
name: Accuracy
- type: f1
value: 56.51
name: F1 (macro)
---
# NagameseBERT
[](https://huggingface.co/MWirelabs/nagamesebert)
[](https://creativecommons.org/licenses/by/4.0/)
[](https://en.wikipedia.org/wiki/Nagamese_Creole)
**A Foundational BERT model for Nagamese Creole** - A compact, efficient language model for a low resource Northeast Indian language.
---
## Overview
NagameseBERT is a 7M parameter RoBERTa-style BERT model pre-trained on 42,552 Nagamese sentences. Despite being 15× smaller than multilingual models like mBERT (110M) and XLM-RoBERTa (125M), it achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages.
**Key Features:**
- **Compact**: 6.9M parameters (15× smaller than mBERT)
- **Efficient**: Pre-trained in 35 minutes on single A40 GPU
- **Custom tokenizer**: 8K BPE vocabulary optimized for Nagamese
- **Rigorous evaluation**: Multi-seed testing (n=3) with reproducible results
- **Open**: Model, code, and data splits publicly available
---
## Performance
Multi-seed evaluation results (mean ± std, n=3):
| Model | Parameters | POS Accuracy | POS F1 | NER Accuracy | NER F1 |
|-------|-----------|--------------|--------|--------------|--------|
| **NagameseBERT** | **7M** | **88.35 ± 0.71%** | **0.807 ± 0.013** | **91.74 ± 0.68%** | **0.565 ± 0.054** |
| mBERT | 110M | 95.14 ± 0.47% | 0.916 ± 0.008 | 96.11 ± 0.72% | 0.750 ± 0.064 |
| XLM-RoBERTa | 125M | 95.64 ± 0.56% | 0.919 ± 0.008 | 96.38 ± 0.26% | 0.819 ± 0.066 |
**Trade-off**: 6-7 percentage points lower accuracy with 15× parameter reduction, enabling resource-constrained deployment.
---
## Model Details
### Architecture
- **Type**: RoBERTa-style BERT (no token type embeddings)
- **Hidden size**: 256
- **Layers**: 6 transformer blocks
- **Attention heads**: 4 per layer
- **Intermediate size**: 1,024
- **Max sequence length**: 64 tokens
- **Total parameters**: 6,878,528
### Tokenizer
- **Type**: Byte-Pair Encoding (BPE)
- **Vocabulary size**: 8,000 tokens
- **Special tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
- **Normalization**: NFD Unicode + accent stripping
- **Case**: Preserved (for proper nouns and code-switched English)
### Training Data
- **Corpus size**: 42,552 Nagamese sentences
- **Average length**: 11.82 tokens/sentence
- **Split**: 90% train (38,296) / 10% validation (4,256)
- **Sources**: Web, social media, community contributions (deduplicated)
### Pre-training
- **Objective**: Masked Language Modeling (15% masking)
- **Optimizer**: AdamW (lr=5e-4, weight_decay=0.01)
- **Batch size**: 64
- **Epochs**: 50
- **Training time**: ~35 minutes
- **Hardware**: NVIDIA A40 (48GB)
- **Final validation loss**: 2.79
---
## Usage
### Load Model and Tokenizer
```python
from transformers import AutoTokenizer, AutoModel
model_name = "MWirelabs/nagamesebert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Example usage
text = "Toi moi laga sathi hobo pare?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```
### Fine-tuning for Token Classification
```python
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
# Load model with classification head
model = AutoModelForTokenClassification.from_pretrained(
"MWirelabs/nagamesebert",
num_labels=num_labels
)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=100,
per_device_train_batch_size=8,
learning_rate=3e-5,
weight_decay=0.01
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
```
---
## Evaluation
### Dataset
- **Source**: [NagaNLP Annotated Corpus](https://huggingface.co/datasets/agnivamaiti/naganlp-ner-annotated-corpus)
- **Total**: 214 sentences
- **Split** (seed=42): 171 train / 21 dev / 22 test (80/10/10)
- **POS tags**: 13 Universal Dependencies tags
- **NER tags**: 4 entity types (PER, LOC, ORG, MISC) in IOB2 format
### Experimental Setup
- **Seeds**: 42, 123, 456 (n=3 for variance estimation)
- **Batch size**: 32
- **Learning rate**: 3e-5
- **Epochs**: 100
- **Optimization**: AdamW with 100 warmup steps
- **Hardware**: NVIDIA A40
- **Metrics**: Token-level accuracy and macro-averaged F1
**Data Leakage Statement**: All splits created with fixed seed (42) with no sentence overlap between train/dev/test sets.
---
## Limitations
- **Corpus size**: 42K sentences is modest; expansion to 100K+ could improve performance
- **Evaluation scale**: Small test set (22 sentences) limits statistical power
- **Task scope**: Only evaluated on token classification; needs broader task assessment
- **Efficiency metrics**: No quantitative inference benchmarks (latency, memory) yet provided
- **Data documentation**: Complete data provenance and licenses to be formalized
---
## Citation
If you use NagameseBERT in your research, please cite:
```bibtex
@misc{nagamesebert2025,
title={Bootstrapping BERT for Nagamese: A Low-Resource Creole Language},
author={MWire Labs},
year={2025},
url={https://huggingface.co/MWirelabs/nagamesebert}
}
```
---
## Contact
**MWire Labs**
Shillong, Meghalaya, India
Website: [MWire Labs](https://mwirelabs.com)
---
## License
This model is released under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
You are free to:
- **Share** — copy and redistribute the material
- **Adapt** — remix, transform, and build upon the material
Under the following terms:
- **Attribution** — You must give appropriate credit to MWire Labs
---
## Acknowledgments
We thank the Nagamese-speaking community for their contributions to corpus development and validation. |