--- language: - nag license: cc-by-4.0 tags: - bert - roberta - nagamese - low-resource - creole - northeast-india - token-classification - fill-mask datasets: - agnivamaiti/naganlp-ner-annotated-corpus metrics: - accuracy - f1 - precision - recall model-index: - name: NagameseBERT results: - task: type: token-classification name: Part-of-Speech Tagging dataset: name: NagaNLP Annotated Corpus type: agnivamaiti/naganlp-ner-annotated-corpus metrics: - type: accuracy value: 88.35 name: Accuracy - type: f1 value: 80.72 name: F1 (macro) - task: type: token-classification name: Named Entity Recognition dataset: name: NagaNLP Annotated Corpus type: agnivamaiti/naganlp-ner-annotated-corpus metrics: - type: accuracy value: 91.74 name: Accuracy - type: f1 value: 56.51 name: F1 (macro) --- # NagameseBERT [![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow)](https://huggingface.co/MWirelabs/nagamesebert) [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/) [![Language](https://img.shields.io/badge/Language-Nagamese-green)](https://en.wikipedia.org/wiki/Nagamese_Creole) **A Foundational BERT model for Nagamese Creole** - A compact, efficient language model for a low resource Northeast Indian language. --- ## Overview NagameseBERT is a 7M parameter RoBERTa-style BERT model pre-trained on 42,552 Nagamese sentences. Despite being 15× smaller than multilingual models like mBERT (110M) and XLM-RoBERTa (125M), it achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages. **Key Features:** - **Compact**: 6.9M parameters (15× smaller than mBERT) - **Efficient**: Pre-trained in 35 minutes on single A40 GPU - **Custom tokenizer**: 8K BPE vocabulary optimized for Nagamese - **Rigorous evaluation**: Multi-seed testing (n=3) with reproducible results - **Open**: Model, code, and data splits publicly available --- ## Performance Multi-seed evaluation results (mean ± std, n=3): | Model | Parameters | POS Accuracy | POS F1 | NER Accuracy | NER F1 | |-------|-----------|--------------|--------|--------------|--------| | **NagameseBERT** | **7M** | **88.35 ± 0.71%** | **0.807 ± 0.013** | **91.74 ± 0.68%** | **0.565 ± 0.054** | | mBERT | 110M | 95.14 ± 0.47% | 0.916 ± 0.008 | 96.11 ± 0.72% | 0.750 ± 0.064 | | XLM-RoBERTa | 125M | 95.64 ± 0.56% | 0.919 ± 0.008 | 96.38 ± 0.26% | 0.819 ± 0.066 | **Trade-off**: 6-7 percentage points lower accuracy with 15× parameter reduction, enabling resource-constrained deployment. --- ## Model Details ### Architecture - **Type**: RoBERTa-style BERT (no token type embeddings) - **Hidden size**: 256 - **Layers**: 6 transformer blocks - **Attention heads**: 4 per layer - **Intermediate size**: 1,024 - **Max sequence length**: 64 tokens - **Total parameters**: 6,878,528 ### Tokenizer - **Type**: Byte-Pair Encoding (BPE) - **Vocabulary size**: 8,000 tokens - **Special tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]` - **Normalization**: NFD Unicode + accent stripping - **Case**: Preserved (for proper nouns and code-switched English) ### Training Data - **Corpus size**: 42,552 Nagamese sentences - **Average length**: 11.82 tokens/sentence - **Split**: 90% train (38,296) / 10% validation (4,256) - **Sources**: Web, social media, community contributions (deduplicated) ### Pre-training - **Objective**: Masked Language Modeling (15% masking) - **Optimizer**: AdamW (lr=5e-4, weight_decay=0.01) - **Batch size**: 64 - **Epochs**: 50 - **Training time**: ~35 minutes - **Hardware**: NVIDIA A40 (48GB) - **Final validation loss**: 2.79 --- ## Usage ### Load Model and Tokenizer ```python from transformers import AutoTokenizer, AutoModel model_name = "MWirelabs/nagamesebert" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # Example usage text = "Toi moi laga sathi hobo pare?" inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) ``` ### Fine-tuning for Token Classification ```python from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer # Load model with classification head model = AutoModelForTokenClassification.from_pretrained( "MWirelabs/nagamesebert", num_labels=num_labels ) # Training arguments training_args = TrainingArguments( output_dir="./results", num_train_epochs=100, per_device_train_batch_size=8, learning_rate=3e-5, weight_decay=0.01 ) # Train trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset ) trainer.train() ``` --- ## Evaluation ### Dataset - **Source**: [NagaNLP Annotated Corpus](https://huggingface.co/datasets/agnivamaiti/naganlp-ner-annotated-corpus) - **Total**: 214 sentences - **Split** (seed=42): 171 train / 21 dev / 22 test (80/10/10) - **POS tags**: 13 Universal Dependencies tags - **NER tags**: 4 entity types (PER, LOC, ORG, MISC) in IOB2 format ### Experimental Setup - **Seeds**: 42, 123, 456 (n=3 for variance estimation) - **Batch size**: 32 - **Learning rate**: 3e-5 - **Epochs**: 100 - **Optimization**: AdamW with 100 warmup steps - **Hardware**: NVIDIA A40 - **Metrics**: Token-level accuracy and macro-averaged F1 **Data Leakage Statement**: All splits created with fixed seed (42) with no sentence overlap between train/dev/test sets. --- ## Limitations - **Corpus size**: 42K sentences is modest; expansion to 100K+ could improve performance - **Evaluation scale**: Small test set (22 sentences) limits statistical power - **Task scope**: Only evaluated on token classification; needs broader task assessment - **Efficiency metrics**: No quantitative inference benchmarks (latency, memory) yet provided - **Data documentation**: Complete data provenance and licenses to be formalized --- ## Citation If you use NagameseBERT in your research, please cite: ```bibtex @misc{nagamesebert2025, title={Bootstrapping BERT for Nagamese: A Low-Resource Creole Language}, author={MWire Labs}, year={2025}, url={https://huggingface.co/MWirelabs/nagamesebert} } ``` --- ## Contact **MWire Labs** Shillong, Meghalaya, India Website: [MWire Labs](https://mwirelabs.com) --- ## License This model is released under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). You are free to: - **Share** — copy and redistribute the material - **Adapt** — remix, transform, and build upon the material Under the following terms: - **Attribution** — You must give appropriate credit to MWire Labs --- ## Acknowledgments We thank the Nagamese-speaking community for their contributions to corpus development and validation.