GaroBERT

GaroBERT is a masked language model for the Garo language, developed by MWire Labs. This model is built on XLM-RoBERTa-base and continues pre-training on a clean corpus of 50,673 Garo sentences.

Model Description

Model Type: Masked Language Model (MLM)
Base Model: xlm-roberta-base
Language: Garo (Latin script)
Parameters: 278M
License: CC-BY-4.0

Training Data

The model was trained on 50,673 Garo sentences (3.1M characters) primarily sourced from parallel corpus creation efforts by the MWire Labs team.

Data Cleaning Pipeline:

Removed URLs, emails, and HTML tags
Normalized whitespace and repeated characters
Filtered sentences with fewer than 3 words or more than 512 words
Removed exact duplicates
Removed special artifacts (e.g., --)

Data Split:

Training: 48,139 sentences (95%)
Evaluation: 2,534 sentences (5%)

Training Details

Hardware: NVIDIA A40 (48GB)

Training Time: 1 hour 13 minutes

Hyperparameters:

Epochs: 20
Learning Rate: 1e-4
Batch Size: 48 (per device)
Gradient Accumulation Steps: 21 (effective batch size: 1,008)
Max Sequence Length: 128
MLM Probability: 0.15
Warmup Ratio: 0.06
Weight Decay: 0.01
Optimizer: AdamW
FP16: Enabled Despite using an aggressive learning rate, training remained stable and validation loss decreased consistently across epochs, with the best checkpoint selected based on held-out evaluation loss.

Performance

Intrinsic Evaluation (MLM on held-out Garo test set):

Model	Perplexity	Eval Loss
XLM-RoBERTa-base (zero-shot)	678.40	6.52
GaroBERT	2.40	0.875

GaroBERT achieves 282× better perplexity compared to the pretrained XLM-RoBERTa baseline, demonstrating strong language modeling capability for Garo.

Tokenization Efficiency:

Average tokens per word: 2.74
Vocabulary coverage: ~100% (0% UNK tokens)
Note: Uses XLM-RoBERTa's original tokenizer without modification

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("MWirelabs/garobert")
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/garobert")

# Example: Fill-mask
from transformers import pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

text = "ia nokni <mask> rong ong·a"
results = fill_mask(text)
print(results)

Intended Use

Primary Applications:

Sentiment analysis for Garo text
Named Entity Recognition (NER)
Text classification tasks
Feature extraction for downstream NLP tasks
Foundation model for Garo language processing

Limitations:

Trained on 50k sentences - performance may vary on domains not represented in training data
Uses XLM-RoBERTa tokenizer with 2.74 tokens/word fertility rate - a custom Garo tokenizer could potentially improve efficiency
Latin script only - does not support other writing systems
Best suited for sentence-level tasks (max 128 tokens)

Fine-tuning

This model can be fine-tuned for various downstream tasks. For sequence classification:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "MWirelabs/garobert",
    num_labels=2  # Adjust based on your task
)

Model Card Authors

MWire Labs Team

Citation

If you use GaroBERT in your research, please cite:

@misc{garobert2025,
  author = {MWire Labs},
  title = {GaroBERT: A Masked Language Model for Garo},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MWirelabs/garobert}}
}

Acknowledgments

We thank the Garo-speaking community for their continued support and contribution to language technology development for Northeast Indian languages.

Contact

For questions or collaboration opportunities, please contact MWire Labs at [contact information].

Part of the MWire Labs Northeast Indian Languages Initiative

Related Models:

Downloads last month: 16

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Perplexity
self-reported

2.400
Eval Loss
self-reported

0.875