# Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language
[](https://huggingface.co/MWireLabs/mizo-roberta)
[](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
[](https://www.apache.org/licenses/LICENSE-2.0)
*Advancing NLP for Northeast Indian Languages*
## Overview
**Mizo-RoBERTa** is a transformer-based language model for Mizo, a Tibeto-Burman language spoken by approximately 1.1 million people primarily in Mizoram, Northeast India. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications.
This work is part of MWireLabs' initiative to develop foundational language models for underserved languages of Northeast India, following our successful [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) model.
### Key Highlights
- **Architecture**: RoBERTa-base (110M parameters)
- **Training Scale**: 5.94M sentences, 138.7M tokens
- **Open Data**: 4M sentences publicly available at [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
- **Custom Tokenizer**: Trained specifically for Mizo (30K BPE vocabulary)
- **Efficient**: Single-epoch training on A40 GPU
- **Open Source**: Model, tokenizer, and training code publicly available
## Model Details
### Architecture
| Component | Specification |
|-----------|--------------|
| Base Architecture | RoBERTa-base |
| Parameters | 109,113,648 (~110M) |
| Layers | 12 transformer layers |
| Attention Heads | 12 |
| Hidden Size | 768 |
| Intermediate Size | 3,072 |
| Max Sequence Length | 512 tokens |
| Vocabulary Size | 30,000 (custom BPE) |
### Training Configuration
| Setting | Value |
|---------|-------|
| Training Data | 5.94M sentences (138.7M tokens) |
| Public Dataset | 4M sentences available on HuggingFace |
| Batch Size | 32 per device |
| Learning Rate | 1e-4 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| Warmup Steps | 10,000 |
| Training Epochs | 2 |
| Hardware | 1x NVIDIA A40 (48GB) |
| Training Time | ~4-6 hours |
| Precision | Mixed (FP16) |
## Training Data
Trained on a large-scale Mizo corpus comprising 5.94 million sentences (138.7 million tokens) with an average of 23.3 tokens per sentence. The corpus includes:
- **News articles** from major Mizo publications
- **Literature** and written content
- **Social media** text
- **Government documents** and official communications
- **Web content** from Mizo language websites
**Public Dataset**: 4 million sentences are openly available at [MWireLabs/mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) for research and development purposes.
### Data Preprocessing
- Unicode normalization
- Language identification and filtering
- Deduplication (exact and near-duplicate removal)
- Quality filtering based on length and character distributions
- Custom sentence segmentation for Mizo punctuation
### Data Split
- **Training**: 5,350,122 sentences (90%)
- **Validation**: 297,229 sentences (5%)
- **Test**: 297,230 sentences (5%)
## Performance
### Language Modeling
| Metric | Value |
|--------|-------|
| Test Perplexity | 15.85 |
| Test Loss | 2.76 |
### Qualitative Examples
The model demonstrates strong understanding of Mizo linguistic patterns and context:
**Example 1: Geographic Knowledge**
```
Input: "Mizoram hi India rama tak a ni"
Top Predictions:
• pawimawh (important) - 9.0%
• State - 4.9%
• ropui (big) - 4.5%
```
**Example 2: Urban Context**
```
Input: "Aizawl hi Mizoram a ni"
Top Predictions:
• khawpui (city) ✓ - 12.9%
• ta - 5.1%
• chhung - 3.9%
✓ Correctly identifies Aizawl as a city (khawpui)
```
### Comparison with Multilingual Models
While we haven't performed direct evaluation against multilingual models on this test set, similar monolingual approaches for low-resource languages (e.g., KhasiBERT for Khasi) have shown 45-50× improvements in perplexity over multilingual baselines like mBERT and XLM-RoBERTa. We expect Mizo-RoBERTa to demonstrate comparable advantages for Mizo language tasks.
## Usage
### Installation
```bash
pip install transformers torch
```
### Quick Start: Masked Language Modeling
```python
from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline
# Load model and tokenizer
model = RobertaForMaskedLM.from_pretrained("MWireLabs/mizo-roberta")
tokenizer = RobertaTokenizerFast.from_pretrained("MWireLabs/mizo-roberta")
# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
# Predict masked words
text = "Mizoram hi rama state a ni"
results = fill_mask(text)
for result in results:
print(f"{result['score']:.3f}: {result['sequence']}")
```
### Extract Embeddings
```python
import torch
# Encode text
text = "Mizo tawng hi kan hman thin a ni"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# Get contextualized embeddings
model.eval()
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Use last hidden state
last_hidden = outputs.hidden_states[-1]
# Mean pooling for sentence embedding
sentence_embedding = last_hidden.mean(dim=1)
print(f"Embedding shape: {sentence_embedding.shape}")
# Output: torch.Size([1, 768])
```
### Fine-tuning for Classification
```python
from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load model for sequence classification
model = RobertaForSequenceClassification.from_pretrained(
"MWireLabs/mizo-roberta",
num_labels=3 # e.g., for sentiment: positive, neutral, negative
)
# Load your labeled dataset
# Example: sentiment analysis dataset
dataset = load_dataset("your-dataset-name")
# Tokenize
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=100,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
)
# Train
trainer.train()
```
### Batch Processing
```python
# Process multiple sentences efficiently
sentences = [
"Aizawl hi Mizoram khawpui ber a ni",
"Mizo tawng hi Mizoram official language a ni",
"India ram Northeast a Mizoram hi a awm"
]
# Tokenize batch
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
# Process outputs as needed
```
## Applications
Mizo-RoBERTa can be fine-tuned for various downstream NLP tasks:
- **Text Classification** (sentiment analysis, topic classification, news categorization)
- **Named Entity Recognition** (NER for Mizo entities)
- **Question Answering** (extractive QA systems)
- **Semantic Similarity** (sentence/document similarity)
- **Information Retrieval** (semantic search in Mizo content)
- **Language Understanding** (natural language inference, textual entailment)
## Limitations
- **Dialectal Coverage**: The model may not comprehensively represent all Mizo dialects
- **Domain Balance**: Formal written text may be overrepresented compared to conversational Mizo
- **Pretraining Objective**: Only trained with Masked Language Modeling (MLM); may benefit from additional objectives
- **Context Length**: Limited to 512 tokens; longer documents require chunking
- **Low-resource Constraints**: While large for Mizo, the training corpus is still smaller than high-resource language datasets
## Ethical Considerations
- **Representation**: The model reflects the content and potential biases present in the training corpus
- **Intended Use**: Designed for research and applications that benefit Mizo language speakers
- **Misuse Potential**: Should not be used for generating misleading information or harmful content
- **Data Privacy**: Training data was collected from publicly available sources; no private information was used
- **Cultural Sensitivity**: Users should be aware of cultural context when deploying for Mizo-speaking communities
## Citation
If you use Mizo-RoBERTa in your research or applications, please cite:
```bibtex
@misc{mizoroberta2025,
title={Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language},
author={MWireLabs},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/MWireLabs/mizo-roberta}}
}
```
## Related Resources
- **Public Training Data**: [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
- **Sister Model**: [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) - RoBERTa model for Khasi language
- **Organization**: [MWireLabs on HuggingFace](https://huggingface.co/MWireLabs)
## Model Card Contact
For questions, issues, or collaboration opportunities:
- **Organization**: MWireLabs
- **Email**: Contact through HuggingFace
- **Issues**: Report on the model's HuggingFace page
## License
This model is released under the Apache 2.0 License. See LICENSE file for details.
## Acknowledgments
We thank the Mizo language community and content creators whose publicly available work made this model possible. Special thanks to all contributors to the open-source NLP ecosystem, particularly the HuggingFace team for their excellent tools and infrastructure.
---
**MWireLabs** - Building AI for Northeast India 🚀