MWirelabs
/

garobert

@@ -1,199 +1,169 @@
 ---
-library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+language:
+- grt
+license: cc-by-4.0
+tags:
+- garo
+- masked-lm
+- bert
+- low-resource
+- northeast-india
+- meghalaya
+- a'chik
+datasets:
+- custom
+metrics:
+- perplexity
+model-index:
+- name: garobert
+  results:
+  - task:
+      type: fill-mask
+      name: Masked Language Modeling
+    metrics:
+    - type: perplexity
+      value: 2.40
+      name: Perplexity
+    - type: loss
+      value: 0.875
+      name: Eval Loss
 ---
+# GaroBERT
+GaroBERT is a masked language model for the Garo language, developed by [MWire Labs](https://mwirelabs.com). This model is built on XLM-RoBERTa-base and continues pre-training on a clean corpus of 50,673 Garo sentences.
+## Model Description
+- **Model Type:** Masked Language Model (MLM)
+- **Base Model:** xlm-roberta-base
+- **Language:** Garo (Latin script)
+- **Parameters:** 278M
+- **License:** CC-BY-4.0
+## Training Data
+The model was trained on 50,673 Garo sentences (3.1M characters) primarily sourced from parallel corpus creation efforts by the MWire Labs team.
+**Data Cleaning Pipeline:**
+- Removed URLs, emails, and HTML tags
+- Normalized whitespace and repeated characters
+- Filtered sentences with fewer than 3 words or more than 512 words
+- Removed exact duplicates
+- Removed special artifacts (e.g., `--`)
+**Data Split:**
+- Training: 48,139 sentences (95%)
+- Evaluation: 2,534 sentences (5%)
 ## Training Details
+**Hardware:** NVIDIA A40 (48GB)
+**Training Time:** 1 hour 13 minutes
+**Hyperparameters:**
+- Epochs: 20
+- Learning Rate: 1e-4
+- Batch Size: 48 (per device)
+- Gradient Accumulation Steps: 21 (effective batch size: 1,008)
+- Max Sequence Length: 128
+- MLM Probability: 0.15
+- Warmup Ratio: 0.06
+- Weight Decay: 0.01
+- Optimizer: AdamW
+- FP16: Enabled
+Despite using an aggressive learning rate, training remained stable and validation loss decreased consistently across epochs, with the best checkpoint selected based on held-out evaluation loss.
+## Performance
+**Intrinsic Evaluation (MLM on held-out Garo test set):**
+| Model | Perplexity | Eval Loss |
+|-------|------------|-----------|
+| XLM-RoBERTa-base (zero-shot) | 678.40 | 6.52 |
+| **GaroBERT** | **2.40** | **0.875** |
+GaroBERT achieves **282× better perplexity** compared to the pretrained XLM-RoBERTa baseline, demonstrating strong language modeling capability for Garo.
+**Tokenization Efficiency:**
+- Average tokens per word: 2.74
+- Vocabulary coverage: ~100% (0% UNK tokens)
+- Note: Uses XLM-RoBERTa's original tokenizer without modification
+## Usage
+```python
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+model = AutoModelForMaskedLM.from_pretrained("MWirelabs/garobert")
+tokenizer = AutoTokenizer.from_pretrained("MWirelabs/garobert")
+# Example: Fill-mask
+from transformers import pipeline
+fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
+text = "ia nokni <mask> rong ong·a"
+results = fill_mask(text)
+print(results)
+```
+## Intended Use
+**Primary Applications:**
+- Sentiment analysis for Garo text
+- Named Entity Recognition (NER)
+- Text classification tasks
+- Feature extraction for downstream NLP tasks
+- Foundation model for Garo language processing
+**Limitations:**
+- Trained on 50k sentences - performance may vary on domains not represented in training data
+- Uses XLM-RoBERTa tokenizer with 2.74 tokens/word fertility rate - a custom Garo tokenizer could potentially improve efficiency
+- Latin script only - does not support other writing systems
+- Best suited for sentence-level tasks (max 128 tokens)
+## Fine-tuning
+This model can be fine-tuned for various downstream tasks. For sequence classification:
+```python
+from transformers import AutoModelForSequenceClassification
+model = AutoModelForSequenceClassification.from_pretrained(
+    "MWirelabs/garobert",
+    num_labels=2  # Adjust based on your task
+)
+```
+## Model Card Authors
+MWire Labs Team
+## Citation
+If you use GaroBERT in your research, please cite:
+```bibtex
+@misc{garobert2025,
+  author = {MWire Labs},
+  title = {GaroBERT: A Masked Language Model for Garo},
+  year = {2025},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/MWirelabs/garobert}}
+}
+```
+## Acknowledgments
+We thank the Garo-speaking community for their continued support and contribution to language technology development for Northeast Indian languages.
+## Contact
+For questions or collaboration opportunities, please contact MWire Labs at [contact information].
+---
+**Part of the MWire Labs Northeast Indian Languages Initiative**
+Related Models:
+- [KhasiBERT](https://huggingface.co/MWirelabs/khasibert)
+- [NyishiBERT](https://huggingface.co/MWirelabs/nyishibert)
+- [NagameseBERT](https://huggingface.co/MWirelabs/nagamesebert)