GaroBERT
GaroBERT is a masked language model for the Garo language, developed by MWire Labs. This model is built on XLM-RoBERTa-base and continues pre-training on a clean corpus of 50,673 Garo sentences.
Model Description
- Model Type: Masked Language Model (MLM)
- Base Model: xlm-roberta-base
- Language: Garo (Latin script)
- Parameters: 278M
- License: CC-BY-4.0
Training Data
The model was trained on 50,673 Garo sentences (3.1M characters) primarily sourced from parallel corpus creation efforts by the MWire Labs team.
Data Cleaning Pipeline:
- Removed URLs, emails, and HTML tags
- Normalized whitespace and repeated characters
- Filtered sentences with fewer than 3 words or more than 512 words
- Removed exact duplicates
- Removed special artifacts (e.g.,
--)
Data Split:
- Training: 48,139 sentences (95%)
- Evaluation: 2,534 sentences (5%)
Training Details
Hardware: NVIDIA A40 (48GB)
Training Time: 1 hour 13 minutes
Hyperparameters:
- Epochs: 20
- Learning Rate: 1e-4
- Batch Size: 48 (per device)
- Gradient Accumulation Steps: 21 (effective batch size: 1,008)
- Max Sequence Length: 128
- MLM Probability: 0.15
- Warmup Ratio: 0.06
- Weight Decay: 0.01
- Optimizer: AdamW
- FP16: Enabled Despite using an aggressive learning rate, training remained stable and validation loss decreased consistently across epochs, with the best checkpoint selected based on held-out evaluation loss.
Performance
Intrinsic Evaluation (MLM on held-out Garo test set):
| Model | Perplexity | Eval Loss |
|---|---|---|
| XLM-RoBERTa-base (zero-shot) | 678.40 | 6.52 |
| GaroBERT | 2.40 | 0.875 |
GaroBERT achieves 282× better perplexity compared to the pretrained XLM-RoBERTa baseline, demonstrating strong language modeling capability for Garo.
Tokenization Efficiency:
- Average tokens per word: 2.74
- Vocabulary coverage: ~100% (0% UNK tokens)
- Note: Uses XLM-RoBERTa's original tokenizer without modification
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("MWirelabs/garobert")
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/garobert")
# Example: Fill-mask
from transformers import pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
text = "ia nokni <mask> rong ong·a"
results = fill_mask(text)
print(results)
Intended Use
Primary Applications:
- Sentiment analysis for Garo text
- Named Entity Recognition (NER)
- Text classification tasks
- Feature extraction for downstream NLP tasks
- Foundation model for Garo language processing
Limitations:
- Trained on 50k sentences - performance may vary on domains not represented in training data
- Uses XLM-RoBERTa tokenizer with 2.74 tokens/word fertility rate - a custom Garo tokenizer could potentially improve efficiency
- Latin script only - does not support other writing systems
- Best suited for sentence-level tasks (max 128 tokens)
Fine-tuning
This model can be fine-tuned for various downstream tasks. For sequence classification:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"MWirelabs/garobert",
num_labels=2 # Adjust based on your task
)
Model Card Authors
MWire Labs Team
Citation
If you use GaroBERT in your research, please cite:
@misc{garobert2025,
author = {MWire Labs},
title = {GaroBERT: A Masked Language Model for Garo},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/MWirelabs/garobert}}
}
Acknowledgments
We thank the Garo-speaking community for their continued support and contribution to language technology development for Northeast Indian languages.
Contact
For questions or collaboration opportunities, please contact MWire Labs at [contact information].
Part of the MWire Labs Northeast Indian Languages Initiative
Related Models:
- Downloads last month
- 8
Evaluation results
- Perplexityself-reported2.400
- Eval Lossself-reported0.875