KokborokBERT
KokborokBERT is the first publicly released masked language model for the Kokborok language. It is built via domain-adaptive fine-tuning of XLM-RoBERTa-base on a curated Kokborok corpus.
Training Performance
The model was trained for 13 epochs on an NVIDIA A40.
| Metric | Baseline (XLM-R) | KokborokBERT |
|---|---|---|
| Masked Loss | 5.9831 | 1.7752 |
| Perplexity | 396.69 | 5.90 |
| Improvement | - | 67.2x error reduction |
Usage
You can use this model directly with a pipeline for masked language modeling:
from transformers import pipeline
mask_filler = pipeline("fill-mask", model="MWirelabs/kokborokbert")
test_text = "O kothar-no nwng jeni-hai-pha-no <mask> khlai-man-nai."
results = mask_filler(test_text)
for res in results:
print(f"Score: {res['score']:.4f} | Prediction: {res['token_str']}")
License
This model is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Limitations and Biases
- Domain Specificity: The model was trained on a specific corpus of ~391k tokens. Performance may vary when applied to dialects or specialized domains (medical, legal) not heavily represented in the training data.
- Base Model Inheritances: As a fine-tuned version of
xlm-roberta-base, this model may inherit biases present in the original multilingual pre-training data. - Task Limitation: This is an encoder-only Masked Language Model. It is designed for tasks like NER, classification, and similarity, but is not intended for text generation (NLG).
Citation
If you use this model in your research, please cite it as follows:
@misc{kokborokbert2026,
author = {MWire Labs},
title = {KokborokBERT: Domain-Adaptive Fine-Tuning of XLM-RoBERTa for Kokborok},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/MWirelabs/kokborokbert}}
}
- Downloads last month
- 12
Evaluation results
- Perplexityself-reported5.900