---
language:
- nag
license: cc-by-4.0
tags:
- bert
- roberta
- nagamese
- low-resource
- creole
- northeast-india
- token-classification
- fill-mask
datasets:
- agnivamaiti/naganlp-ner-annotated-corpus
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: NagameseBERT
  results:
  - task:
      type: token-classification
      name: Part-of-Speech Tagging
    dataset:
      name: NagaNLP Annotated Corpus
      type: agnivamaiti/naganlp-ner-annotated-corpus
    metrics:
    - type: accuracy
      value: 88.35
      name: Accuracy
    - type: f1
      value: 80.72
      name: F1 (macro)
  - task:
      type: token-classification
      name: Named Entity Recognition
    dataset:
      name: NagaNLP Annotated Corpus
      type: agnivamaiti/naganlp-ner-annotated-corpus
    metrics:
    - type: accuracy
      value: 91.74
      name: Accuracy
    - type: f1
      value: 56.51
      name: F1 (macro)
---

# NagameseBERT

[![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow)](https://huggingface.co/MWirelabs/nagamesebert)
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/)
[![Language](https://img.shields.io/badge/Language-Nagamese-green)](https://en.wikipedia.org/wiki/Nagamese_Creole)

**A Foundational BERT model for Nagamese Creole** - A compact, efficient language model for a low resource Northeast Indian language.

---

## Overview

NagameseBERT is a 7M parameter RoBERTa-style BERT model pre-trained on 42,552 Nagamese sentences. Despite being 15× smaller than multilingual models like mBERT (110M) and XLM-RoBERTa (125M), it achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages.

**Key Features:**
- **Compact**: 6.9M parameters (15× smaller than mBERT)
- **Efficient**: Pre-trained in 35 minutes on single A40 GPU
- **Custom tokenizer**: 8K BPE vocabulary optimized for Nagamese
- **Rigorous evaluation**: Multi-seed testing (n=3) with reproducible results
- **Open**: Model, code, and data splits publicly available

---

## Performance

Multi-seed evaluation results (mean ± std, n=3):

| Model | Parameters | POS Accuracy | POS F1 | NER Accuracy | NER F1 |
|-------|-----------|--------------|--------|--------------|--------|
| **NagameseBERT** | **7M** | **88.35 ± 0.71%** | **0.807 ± 0.013** | **91.74 ± 0.68%** | **0.565 ± 0.054** |
| mBERT | 110M | 95.14 ± 0.47% | 0.916 ± 0.008 | 96.11 ± 0.72% | 0.750 ± 0.064 |
| XLM-RoBERTa | 125M | 95.64 ± 0.56% | 0.919 ± 0.008 | 96.38 ± 0.26% | 0.819 ± 0.066 |

**Trade-off**: 6-7 percentage points lower accuracy with 15× parameter reduction, enabling resource-constrained deployment.

---

## Model Details

### Architecture
- **Type**: RoBERTa-style BERT (no token type embeddings)
- **Hidden size**: 256
- **Layers**: 6 transformer blocks
- **Attention heads**: 4 per layer
- **Intermediate size**: 1,024
- **Max sequence length**: 64 tokens
- **Total parameters**: 6,878,528

### Tokenizer
- **Type**: Byte-Pair Encoding (BPE)
- **Vocabulary size**: 8,000 tokens
- **Special tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
- **Normalization**: NFD Unicode + accent stripping
- **Case**: Preserved (for proper nouns and code-switched English)

### Training Data
- **Corpus size**: 42,552 Nagamese sentences
- **Average length**: 11.82 tokens/sentence
- **Split**: 90% train (38,296) / 10% validation (4,256)
- **Sources**: Web, social media, community contributions (deduplicated)

### Pre-training
- **Objective**: Masked Language Modeling (15% masking)
- **Optimizer**: AdamW (lr=5e-4, weight_decay=0.01)
- **Batch size**: 64
- **Epochs**: 50
- **Training time**: ~35 minutes
- **Hardware**: NVIDIA A40 (48GB)
- **Final validation loss**: 2.79

---

## Usage

### Load Model and Tokenizer
```python
from transformers import AutoTokenizer, AutoModel

model_name = "MWirelabs/nagamesebert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Example usage
text = "Toi moi laga sathi hobo pare?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```

### Fine-tuning for Token Classification
```python
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

# Load model with classification head
model = AutoModelForTokenClassification.from_pretrained(
    "MWirelabs/nagamesebert",
    num_labels=num_labels
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=100,
    per_device_train_batch_size=8,
    learning_rate=3e-5,
    weight_decay=0.01
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)
trainer.train()
```

---

## Evaluation

### Dataset
- **Source**: [NagaNLP Annotated Corpus](https://huggingface.co/datasets/agnivamaiti/naganlp-ner-annotated-corpus)
- **Total**: 214 sentences
- **Split** (seed=42): 171 train / 21 dev / 22 test (80/10/10)
- **POS tags**: 13 Universal Dependencies tags
- **NER tags**: 4 entity types (PER, LOC, ORG, MISC) in IOB2 format

### Experimental Setup
- **Seeds**: 42, 123, 456 (n=3 for variance estimation)
- **Batch size**: 32
- **Learning rate**: 3e-5
- **Epochs**: 100
- **Optimization**: AdamW with 100 warmup steps
- **Hardware**: NVIDIA A40
- **Metrics**: Token-level accuracy and macro-averaged F1

**Data Leakage Statement**: All splits created with fixed seed (42) with no sentence overlap between train/dev/test sets.

---

## Limitations

- **Corpus size**: 42K sentences is modest; expansion to 100K+ could improve performance
- **Evaluation scale**: Small test set (22 sentences) limits statistical power
- **Task scope**: Only evaluated on token classification; needs broader task assessment
- **Efficiency metrics**: No quantitative inference benchmarks (latency, memory) yet provided
- **Data documentation**: Complete data provenance and licenses to be formalized

---

## Citation

If you use NagameseBERT in your research, please cite:
```bibtex
@misc{nagamesebert2025,
  title={Bootstrapping BERT for Nagamese: A Low-Resource Creole Language},
  author={MWire Labs},
  year={2025},
  url={https://huggingface.co/MWirelabs/nagamesebert}
}
```

---

## Contact

**MWire Labs**  
Shillong, Meghalaya, India  
Website: [MWire Labs](https://mwirelabs.com)

---

## License

This model is released under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).

You are free to:
- **Share** — copy and redistribute the material
- **Adapt** — remix, transform, and build upon the material

Under the following terms:
- **Attribution** — You must give appropriate credit to MWire Labs

---

## Acknowledgments

We thank the Nagamese-speaking community for their contributions to corpus development and validation.