nagamesebert / README.md

Update README.md

42f3fb3 verified 20 days ago

6.94 kB

	---
	language:
	- nag
	license: cc-by-4.0
	tags:
	- bert
	- roberta
	- nagamese
	- low-resource
	- creole
	- northeast-india
	- token-classification
	- fill-mask
	datasets:
	- agnivamaiti/naganlp-ner-annotated-corpus
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: NagameseBERT
	results:
	- task:
	type: token-classification
	name: Part-of-Speech Tagging
	dataset:
	name: NagaNLP Annotated Corpus
	type: agnivamaiti/naganlp-ner-annotated-corpus
	metrics:
	- type: accuracy
	value: 88.35
	name: Accuracy
	- type: f1
	value: 80.72
	name: F1 (macro)
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: NagaNLP Annotated Corpus
	type: agnivamaiti/naganlp-ner-annotated-corpus
	metrics:
	- type: accuracy
	value: 91.74
	name: Accuracy
	- type: f1
	value: 56.51
	name: F1 (macro)
	---

	# NagameseBERT

	[![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow)](https://huggingface.co/MWirelabs/nagamesebert)
	[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/)
	[![Language](https://img.shields.io/badge/Language-Nagamese-green)](https://en.wikipedia.org/wiki/Nagamese_Creole)

	A Foundational BERT model for Nagamese Creole - A compact, efficient language model for a low resource Northeast Indian language.

	---

	## Overview

	NagameseBERT is a 7M parameter RoBERTa-style BERT model pre-trained on 42,552 Nagamese sentences. Despite being 15× smaller than multilingual models like mBERT (110M) and XLM-RoBERTa (125M), it achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages.

	Key Features:
	- Compact: 6.9M parameters (15× smaller than mBERT)
	- Efficient: Pre-trained in 35 minutes on single A40 GPU
	- Custom tokenizer: 8K BPE vocabulary optimized for Nagamese
	- Rigorous evaluation: Multi-seed testing (n=3) with reproducible results
	- Open: Model, code, and data splits publicly available

	---

	## Performance

	Multi-seed evaluation results (mean ± std, n=3):

	\| Model \| Parameters \| POS Accuracy \| POS F1 \| NER Accuracy \| NER F1 \|
	\|-------\|-----------\|--------------\|--------\|--------------\|--------\|
	\| NagameseBERT \| 7M \| 88.35 ± 0.71% \| 0.807 ± 0.013 \| 91.74 ± 0.68% \| 0.565 ± 0.054 \|
	\| mBERT \| 110M \| 95.14 ± 0.47% \| 0.916 ± 0.008 \| 96.11 ± 0.72% \| 0.750 ± 0.064 \|
	\| XLM-RoBERTa \| 125M \| 95.64 ± 0.56% \| 0.919 ± 0.008 \| 96.38 ± 0.26% \| 0.819 ± 0.066 \|

	Trade-off: 6-7 percentage points lower accuracy with 15× parameter reduction, enabling resource-constrained deployment.

	---

	## Model Details

	### Architecture
	- Type: RoBERTa-style BERT (no token type embeddings)
	- Hidden size: 256
	- Layers: 6 transformer blocks
	- Attention heads: 4 per layer
	- Intermediate size: 1,024
	- Max sequence length: 64 tokens
	- Total parameters: 6,878,528

	### Tokenizer
	- Type: Byte-Pair Encoding (BPE)
	- Vocabulary size: 8,000 tokens
	- Special tokens: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
	- Normalization: NFD Unicode + accent stripping
	- Case: Preserved (for proper nouns and code-switched English)

	### Training Data
	- Corpus size: 42,552 Nagamese sentences
	- Average length: 11.82 tokens/sentence
	- Split: 90% train (38,296) / 10% validation (4,256)
	- Sources: Web, social media, community contributions (deduplicated)

	### Pre-training
	- Objective: Masked Language Modeling (15% masking)
	- Optimizer: AdamW (lr=5e-4, weight_decay=0.01)
	- Batch size: 64
	- Epochs: 50
	- Training time: ~35 minutes
	- Hardware: NVIDIA A40 (48GB)
	- Final validation loss: 2.79

	---

	## Usage

	### Load Model and Tokenizer
	```python
	from transformers import AutoTokenizer, AutoModel

	model_name = "MWirelabs/nagamesebert"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name)

	# Example usage
	text = "Toi moi laga sathi hobo pare?"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)
	```

	### Fine-tuning for Token Classification
	```python
	from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

	# Load model with classification head
	model = AutoModelForTokenClassification.from_pretrained(
	"MWirelabs/nagamesebert",
	num_labels=num_labels
	)

	# Training arguments
	training_args = TrainingArguments(
	output_dir="./results",
	num_train_epochs=100,
	per_device_train_batch_size=8,
	learning_rate=3e-5,
	weight_decay=0.01
	)

	# Train
	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=train_dataset,
	eval_dataset=eval_dataset
	)
	trainer.train()
	```

	---

	## Evaluation

	### Dataset
	- Source: [NagaNLP Annotated Corpus](https://huggingface.co/datasets/agnivamaiti/naganlp-ner-annotated-corpus)
	- Total: 214 sentences
	- Split (seed=42): 171 train / 21 dev / 22 test (80/10/10)
	- POS tags: 13 Universal Dependencies tags
	- NER tags: 4 entity types (PER, LOC, ORG, MISC) in IOB2 format

	### Experimental Setup
	- Seeds: 42, 123, 456 (n=3 for variance estimation)
	- Batch size: 32
	- Learning rate: 3e-5
	- Epochs: 100
	- Optimization: AdamW with 100 warmup steps
	- Hardware: NVIDIA A40
	- Metrics: Token-level accuracy and macro-averaged F1

	Data Leakage Statement: All splits created with fixed seed (42) with no sentence overlap between train/dev/test sets.

	---

	## Limitations

	- Corpus size: 42K sentences is modest; expansion to 100K+ could improve performance
	- Evaluation scale: Small test set (22 sentences) limits statistical power
	- Task scope: Only evaluated on token classification; needs broader task assessment
	- Efficiency metrics: No quantitative inference benchmarks (latency, memory) yet provided
	- Data documentation: Complete data provenance and licenses to be formalized

	---

	## Citation

	If you use NagameseBERT in your research, please cite:
	```bibtex
	@misc{nagamesebert2025,
	title={Bootstrapping BERT for Nagamese: A Low-Resource Creole Language},
	author={MWire Labs},
	year={2025},
	url={https://huggingface.co/MWirelabs/nagamesebert}
	}
	```

	---

	## Contact

	MWire Labs
	Shillong, Meghalaya, India
	Website: [MWire Labs](https://mwirelabs.com)

	---

	## License

	This model is released under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).

	You are free to:
	- Share — copy and redistribute the material
	- Adapt — remix, transform, and build upon the material

	Under the following terms:
	- Attribution — You must give appropriate credit to MWire Labs

	---

	## Acknowledgments

	We thank the Nagamese-speaking community for their contributions to corpus development and validation.