GaroBERT

GaroBERT is a masked language model for the Garo language, developed by MWire Labs. This model is built on XLM-RoBERTa-base and continues pre-training on a clean corpus of 50,673 Garo sentences.

Model Description

  • Model Type: Masked Language Model (MLM)
  • Base Model: xlm-roberta-base
  • Language: Garo (Latin script)
  • Parameters: 278M
  • License: CC-BY-4.0

Training Data

The model was trained on 50,673 Garo sentences (3.1M characters) primarily sourced from parallel corpus creation efforts by the MWire Labs team.

Data Cleaning Pipeline:

  • Removed URLs, emails, and HTML tags
  • Normalized whitespace and repeated characters
  • Filtered sentences with fewer than 3 words or more than 512 words
  • Removed exact duplicates
  • Removed special artifacts (e.g., --)

Data Split:

  • Training: 48,139 sentences (95%)
  • Evaluation: 2,534 sentences (5%)

Training Details

Hardware: NVIDIA A40 (48GB)

Training Time: 1 hour 13 minutes

Hyperparameters:

  • Epochs: 20
  • Learning Rate: 1e-4
  • Batch Size: 48 (per device)
  • Gradient Accumulation Steps: 21 (effective batch size: 1,008)
  • Max Sequence Length: 128
  • MLM Probability: 0.15
  • Warmup Ratio: 0.06
  • Weight Decay: 0.01
  • Optimizer: AdamW
  • FP16: Enabled Despite using an aggressive learning rate, training remained stable and validation loss decreased consistently across epochs, with the best checkpoint selected based on held-out evaluation loss.

Performance

Intrinsic Evaluation (MLM on held-out Garo test set):

Model Perplexity Eval Loss
XLM-RoBERTa-base (zero-shot) 678.40 6.52
GaroBERT 2.40 0.875

GaroBERT achieves 282× better perplexity compared to the pretrained XLM-RoBERTa baseline, demonstrating strong language modeling capability for Garo.

Tokenization Efficiency:

  • Average tokens per word: 2.74
  • Vocabulary coverage: ~100% (0% UNK tokens)
  • Note: Uses XLM-RoBERTa's original tokenizer without modification

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("MWirelabs/garobert")
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/garobert")

# Example: Fill-mask
from transformers import pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

text = "ia nokni <mask> rong ong·a"
results = fill_mask(text)
print(results)

Intended Use

Primary Applications:

  • Sentiment analysis for Garo text
  • Named Entity Recognition (NER)
  • Text classification tasks
  • Feature extraction for downstream NLP tasks
  • Foundation model for Garo language processing

Limitations:

  • Trained on 50k sentences - performance may vary on domains not represented in training data
  • Uses XLM-RoBERTa tokenizer with 2.74 tokens/word fertility rate - a custom Garo tokenizer could potentially improve efficiency
  • Latin script only - does not support other writing systems
  • Best suited for sentence-level tasks (max 128 tokens)

Fine-tuning

This model can be fine-tuned for various downstream tasks. For sequence classification:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "MWirelabs/garobert",
    num_labels=2  # Adjust based on your task
)

Model Card Authors

MWire Labs Team

Citation

If you use GaroBERT in your research, please cite:

@misc{garobert2025,
  author = {MWire Labs},
  title = {GaroBERT: A Masked Language Model for Garo},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MWirelabs/garobert}}
}

Acknowledgments

We thank the Garo-speaking community for their continued support and contribution to language technology development for Northeast Indian languages.

Contact

For questions or collaboration opportunities, please contact MWire Labs at [contact information].


Part of the MWire Labs Northeast Indian Languages Initiative

Related Models:

Downloads last month
8
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results