|
|
--- |
|
|
language: |
|
|
- nag |
|
|
license: cc-by-4.0 |
|
|
tags: |
|
|
- bert |
|
|
- roberta |
|
|
- nagamese |
|
|
- low-resource |
|
|
- creole |
|
|
- northeast-india |
|
|
- token-classification |
|
|
- fill-mask |
|
|
datasets: |
|
|
- agnivamaiti/naganlp-ner-annotated-corpus |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
model-index: |
|
|
- name: NagameseBERT |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Part-of-Speech Tagging |
|
|
dataset: |
|
|
name: NagaNLP Annotated Corpus |
|
|
type: agnivamaiti/naganlp-ner-annotated-corpus |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 88.35 |
|
|
name: Accuracy |
|
|
- type: f1 |
|
|
value: 80.72 |
|
|
name: F1 (macro) |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Named Entity Recognition |
|
|
dataset: |
|
|
name: NagaNLP Annotated Corpus |
|
|
type: agnivamaiti/naganlp-ner-annotated-corpus |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 91.74 |
|
|
name: Accuracy |
|
|
- type: f1 |
|
|
value: 56.51 |
|
|
name: F1 (macro) |
|
|
--- |
|
|
|
|
|
# NagameseBERT |
|
|
|
|
|
[](https://huggingface.co/MWirelabs/nagamesebert) |
|
|
[](https://creativecommons.org/licenses/by/4.0/) |
|
|
[](https://en.wikipedia.org/wiki/Nagamese_Creole) |
|
|
|
|
|
**A Foundational BERT model for Nagamese Creole** - A compact, efficient language model for a low resource Northeast Indian language. |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
NagameseBERT is a 7M parameter RoBERTa-style BERT model pre-trained on 42,552 Nagamese sentences. Despite being 15× smaller than multilingual models like mBERT (110M) and XLM-RoBERTa (125M), it achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages. |
|
|
|
|
|
**Key Features:** |
|
|
- **Compact**: 6.9M parameters (15× smaller than mBERT) |
|
|
- **Efficient**: Pre-trained in 35 minutes on single A40 GPU |
|
|
- **Custom tokenizer**: 8K BPE vocabulary optimized for Nagamese |
|
|
- **Rigorous evaluation**: Multi-seed testing (n=3) with reproducible results |
|
|
- **Open**: Model, code, and data splits publicly available |
|
|
|
|
|
--- |
|
|
|
|
|
## Performance |
|
|
|
|
|
Multi-seed evaluation results (mean ± std, n=3): |
|
|
|
|
|
| Model | Parameters | POS Accuracy | POS F1 | NER Accuracy | NER F1 | |
|
|
|-------|-----------|--------------|--------|--------------|--------| |
|
|
| **NagameseBERT** | **7M** | **88.35 ± 0.71%** | **0.807 ± 0.013** | **91.74 ± 0.68%** | **0.565 ± 0.054** | |
|
|
| mBERT | 110M | 95.14 ± 0.47% | 0.916 ± 0.008 | 96.11 ± 0.72% | 0.750 ± 0.064 | |
|
|
| XLM-RoBERTa | 125M | 95.64 ± 0.56% | 0.919 ± 0.008 | 96.38 ± 0.26% | 0.819 ± 0.066 | |
|
|
|
|
|
**Trade-off**: 6-7 percentage points lower accuracy with 15× parameter reduction, enabling resource-constrained deployment. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Architecture |
|
|
- **Type**: RoBERTa-style BERT (no token type embeddings) |
|
|
- **Hidden size**: 256 |
|
|
- **Layers**: 6 transformer blocks |
|
|
- **Attention heads**: 4 per layer |
|
|
- **Intermediate size**: 1,024 |
|
|
- **Max sequence length**: 64 tokens |
|
|
- **Total parameters**: 6,878,528 |
|
|
|
|
|
### Tokenizer |
|
|
- **Type**: Byte-Pair Encoding (BPE) |
|
|
- **Vocabulary size**: 8,000 tokens |
|
|
- **Special tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]` |
|
|
- **Normalization**: NFD Unicode + accent stripping |
|
|
- **Case**: Preserved (for proper nouns and code-switched English) |
|
|
|
|
|
### Training Data |
|
|
- **Corpus size**: 42,552 Nagamese sentences |
|
|
- **Average length**: 11.82 tokens/sentence |
|
|
- **Split**: 90% train (38,296) / 10% validation (4,256) |
|
|
- **Sources**: Web, social media, community contributions (deduplicated) |
|
|
|
|
|
### Pre-training |
|
|
- **Objective**: Masked Language Modeling (15% masking) |
|
|
- **Optimizer**: AdamW (lr=5e-4, weight_decay=0.01) |
|
|
- **Batch size**: 64 |
|
|
- **Epochs**: 50 |
|
|
- **Training time**: ~35 minutes |
|
|
- **Hardware**: NVIDIA A40 (48GB) |
|
|
- **Final validation loss**: 2.79 |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Load Model and Tokenizer |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
model_name = "MWirelabs/nagamesebert" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModel.from_pretrained(model_name) |
|
|
|
|
|
# Example usage |
|
|
text = "Toi moi laga sathi hobo pare?" |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
``` |
|
|
|
|
|
### Fine-tuning for Token Classification |
|
|
```python |
|
|
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer |
|
|
|
|
|
# Load model with classification head |
|
|
model = AutoModelForTokenClassification.from_pretrained( |
|
|
"MWirelabs/nagamesebert", |
|
|
num_labels=num_labels |
|
|
) |
|
|
|
|
|
# Training arguments |
|
|
training_args = TrainingArguments( |
|
|
output_dir="./results", |
|
|
num_train_epochs=100, |
|
|
per_device_train_batch_size=8, |
|
|
learning_rate=3e-5, |
|
|
weight_decay=0.01 |
|
|
) |
|
|
|
|
|
# Train |
|
|
trainer = Trainer( |
|
|
model=model, |
|
|
args=training_args, |
|
|
train_dataset=train_dataset, |
|
|
eval_dataset=eval_dataset |
|
|
) |
|
|
trainer.train() |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Dataset |
|
|
- **Source**: [NagaNLP Annotated Corpus](https://huggingface.co/datasets/agnivamaiti/naganlp-ner-annotated-corpus) |
|
|
- **Total**: 214 sentences |
|
|
- **Split** (seed=42): 171 train / 21 dev / 22 test (80/10/10) |
|
|
- **POS tags**: 13 Universal Dependencies tags |
|
|
- **NER tags**: 4 entity types (PER, LOC, ORG, MISC) in IOB2 format |
|
|
|
|
|
### Experimental Setup |
|
|
- **Seeds**: 42, 123, 456 (n=3 for variance estimation) |
|
|
- **Batch size**: 32 |
|
|
- **Learning rate**: 3e-5 |
|
|
- **Epochs**: 100 |
|
|
- **Optimization**: AdamW with 100 warmup steps |
|
|
- **Hardware**: NVIDIA A40 |
|
|
- **Metrics**: Token-level accuracy and macro-averaged F1 |
|
|
|
|
|
**Data Leakage Statement**: All splits created with fixed seed (42) with no sentence overlap between train/dev/test sets. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Corpus size**: 42K sentences is modest; expansion to 100K+ could improve performance |
|
|
- **Evaluation scale**: Small test set (22 sentences) limits statistical power |
|
|
- **Task scope**: Only evaluated on token classification; needs broader task assessment |
|
|
- **Efficiency metrics**: No quantitative inference benchmarks (latency, memory) yet provided |
|
|
- **Data documentation**: Complete data provenance and licenses to be formalized |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use NagameseBERT in your research, please cite: |
|
|
```bibtex |
|
|
@misc{nagamesebert2025, |
|
|
title={Bootstrapping BERT for Nagamese: A Low-Resource Creole Language}, |
|
|
author={MWire Labs}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/MWirelabs/nagamesebert} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Contact |
|
|
|
|
|
**MWire Labs** |
|
|
Shillong, Meghalaya, India |
|
|
Website: [MWire Labs](https://mwirelabs.com) |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). |
|
|
|
|
|
You are free to: |
|
|
- **Share** — copy and redistribute the material |
|
|
- **Adapt** — remix, transform, and build upon the material |
|
|
|
|
|
Under the following terms: |
|
|
- **Attribution** — You must give appropriate credit to MWire Labs |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
We thank the Nagamese-speaking community for their contributions to corpus development and validation. |