nagamesebert / README.md
Badnyal's picture
Update README.md
42f3fb3 verified
---
language:
- nag
license: cc-by-4.0
tags:
- bert
- roberta
- nagamese
- low-resource
- creole
- northeast-india
- token-classification
- fill-mask
datasets:
- agnivamaiti/naganlp-ner-annotated-corpus
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: NagameseBERT
results:
- task:
type: token-classification
name: Part-of-Speech Tagging
dataset:
name: NagaNLP Annotated Corpus
type: agnivamaiti/naganlp-ner-annotated-corpus
metrics:
- type: accuracy
value: 88.35
name: Accuracy
- type: f1
value: 80.72
name: F1 (macro)
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: NagaNLP Annotated Corpus
type: agnivamaiti/naganlp-ner-annotated-corpus
metrics:
- type: accuracy
value: 91.74
name: Accuracy
- type: f1
value: 56.51
name: F1 (macro)
---
# NagameseBERT
[![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow)](https://huggingface.co/MWirelabs/nagamesebert)
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/)
[![Language](https://img.shields.io/badge/Language-Nagamese-green)](https://en.wikipedia.org/wiki/Nagamese_Creole)
**A Foundational BERT model for Nagamese Creole** - A compact, efficient language model for a low resource Northeast Indian language.
---
## Overview
NagameseBERT is a 7M parameter RoBERTa-style BERT model pre-trained on 42,552 Nagamese sentences. Despite being 15× smaller than multilingual models like mBERT (110M) and XLM-RoBERTa (125M), it achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages.
**Key Features:**
- **Compact**: 6.9M parameters (15× smaller than mBERT)
- **Efficient**: Pre-trained in 35 minutes on single A40 GPU
- **Custom tokenizer**: 8K BPE vocabulary optimized for Nagamese
- **Rigorous evaluation**: Multi-seed testing (n=3) with reproducible results
- **Open**: Model, code, and data splits publicly available
---
## Performance
Multi-seed evaluation results (mean ± std, n=3):
| Model | Parameters | POS Accuracy | POS F1 | NER Accuracy | NER F1 |
|-------|-----------|--------------|--------|--------------|--------|
| **NagameseBERT** | **7M** | **88.35 ± 0.71%** | **0.807 ± 0.013** | **91.74 ± 0.68%** | **0.565 ± 0.054** |
| mBERT | 110M | 95.14 ± 0.47% | 0.916 ± 0.008 | 96.11 ± 0.72% | 0.750 ± 0.064 |
| XLM-RoBERTa | 125M | 95.64 ± 0.56% | 0.919 ± 0.008 | 96.38 ± 0.26% | 0.819 ± 0.066 |
**Trade-off**: 6-7 percentage points lower accuracy with 15× parameter reduction, enabling resource-constrained deployment.
---
## Model Details
### Architecture
- **Type**: RoBERTa-style BERT (no token type embeddings)
- **Hidden size**: 256
- **Layers**: 6 transformer blocks
- **Attention heads**: 4 per layer
- **Intermediate size**: 1,024
- **Max sequence length**: 64 tokens
- **Total parameters**: 6,878,528
### Tokenizer
- **Type**: Byte-Pair Encoding (BPE)
- **Vocabulary size**: 8,000 tokens
- **Special tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
- **Normalization**: NFD Unicode + accent stripping
- **Case**: Preserved (for proper nouns and code-switched English)
### Training Data
- **Corpus size**: 42,552 Nagamese sentences
- **Average length**: 11.82 tokens/sentence
- **Split**: 90% train (38,296) / 10% validation (4,256)
- **Sources**: Web, social media, community contributions (deduplicated)
### Pre-training
- **Objective**: Masked Language Modeling (15% masking)
- **Optimizer**: AdamW (lr=5e-4, weight_decay=0.01)
- **Batch size**: 64
- **Epochs**: 50
- **Training time**: ~35 minutes
- **Hardware**: NVIDIA A40 (48GB)
- **Final validation loss**: 2.79
---
## Usage
### Load Model and Tokenizer
```python
from transformers import AutoTokenizer, AutoModel
model_name = "MWirelabs/nagamesebert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Example usage
text = "Toi moi laga sathi hobo pare?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```
### Fine-tuning for Token Classification
```python
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
# Load model with classification head
model = AutoModelForTokenClassification.from_pretrained(
"MWirelabs/nagamesebert",
num_labels=num_labels
)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=100,
per_device_train_batch_size=8,
learning_rate=3e-5,
weight_decay=0.01
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
```
---
## Evaluation
### Dataset
- **Source**: [NagaNLP Annotated Corpus](https://huggingface.co/datasets/agnivamaiti/naganlp-ner-annotated-corpus)
- **Total**: 214 sentences
- **Split** (seed=42): 171 train / 21 dev / 22 test (80/10/10)
- **POS tags**: 13 Universal Dependencies tags
- **NER tags**: 4 entity types (PER, LOC, ORG, MISC) in IOB2 format
### Experimental Setup
- **Seeds**: 42, 123, 456 (n=3 for variance estimation)
- **Batch size**: 32
- **Learning rate**: 3e-5
- **Epochs**: 100
- **Optimization**: AdamW with 100 warmup steps
- **Hardware**: NVIDIA A40
- **Metrics**: Token-level accuracy and macro-averaged F1
**Data Leakage Statement**: All splits created with fixed seed (42) with no sentence overlap between train/dev/test sets.
---
## Limitations
- **Corpus size**: 42K sentences is modest; expansion to 100K+ could improve performance
- **Evaluation scale**: Small test set (22 sentences) limits statistical power
- **Task scope**: Only evaluated on token classification; needs broader task assessment
- **Efficiency metrics**: No quantitative inference benchmarks (latency, memory) yet provided
- **Data documentation**: Complete data provenance and licenses to be formalized
---
## Citation
If you use NagameseBERT in your research, please cite:
```bibtex
@misc{nagamesebert2025,
title={Bootstrapping BERT for Nagamese: A Low-Resource Creole Language},
author={MWire Labs},
year={2025},
url={https://huggingface.co/MWirelabs/nagamesebert}
}
```
---
## Contact
**MWire Labs**
Shillong, Meghalaya, India
Website: [MWire Labs](https://mwirelabs.com)
---
## License
This model is released under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
You are free to:
- **Share** — copy and redistribute the material
- **Adapt** — remix, transform, and build upon the material
Under the following terms:
- **Attribution** — You must give appropriate credit to MWire Labs
---
## Acknowledgments
We thank the Nagamese-speaking community for their contributions to corpus development and validation.