# Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language

<div align="center">

[![Model](https://img.shields.io/badge/🤗-Model-yellow)](https://huggingface.co/MWireLabs/mizo-roberta)
[![Dataset](https://img.shields.io/badge/🤗-Public%20Dataset-blue)](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)

*Advancing NLP for Northeast Indian Languages*

</div>

## Overview

**Mizo-RoBERTa** is a transformer-based language model for Mizo, a Tibeto-Burman language spoken by approximately 1.1 million people primarily in Mizoram, Northeast India. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications.

This work is part of MWireLabs' initiative to develop foundational language models for underserved languages of Northeast India, following our successful [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) model.

### Key Highlights

- **Architecture**: RoBERTa-base (110M parameters)
- **Training Scale**: 5.94M sentences, 138.7M tokens
- **Open Data**: 4M sentences publicly available at [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
- **Custom Tokenizer**: Trained specifically for Mizo (30K BPE vocabulary)
- **Efficient**: Single-epoch training on A40 GPU
- **Open Source**: Model, tokenizer, and training code publicly available

## Model Details

### Architecture

| Component | Specification |
|-----------|--------------|
| Base Architecture | RoBERTa-base |
| Parameters | 109,113,648 (~110M) |
| Layers | 12 transformer layers |
| Attention Heads | 12 |
| Hidden Size | 768 |
| Intermediate Size | 3,072 |
| Max Sequence Length | 512 tokens |
| Vocabulary Size | 30,000 (custom BPE) |

### Training Configuration

| Setting | Value |
|---------|-------|
| Training Data | 5.94M sentences (138.7M tokens) |
| Public Dataset | 4M sentences available on HuggingFace |
| Batch Size | 32 per device |
| Learning Rate | 1e-4 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| Warmup Steps | 10,000 |
| Training Epochs | 2 |
| Hardware | 1x NVIDIA A40 (48GB) |
| Training Time | ~4-6 hours |
| Precision | Mixed (FP16) |

## Training Data

Trained on a large-scale Mizo corpus comprising 5.94 million sentences (138.7 million tokens) with an average of 23.3 tokens per sentence. The corpus includes:

- **News articles** from major Mizo publications
- **Literature** and written content
- **Social media** text
- **Government documents** and official communications
- **Web content** from Mizo language websites

**Public Dataset**: 4 million sentences are openly available at [MWireLabs/mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) for research and development purposes.

### Data Preprocessing

- Unicode normalization
- Language identification and filtering
- Deduplication (exact and near-duplicate removal)
- Quality filtering based on length and character distributions
- Custom sentence segmentation for Mizo punctuation

### Data Split

- **Training**: 5,350,122 sentences (90%)
- **Validation**: 297,229 sentences (5%)
- **Test**: 297,230 sentences (5%)

## Performance

### Language Modeling

| Metric | Value |
|--------|-------|
| Test Perplexity | 15.85 |
| Test Loss | 2.76 |

### Qualitative Examples

The model demonstrates strong understanding of Mizo linguistic patterns and context:

**Example 1: Geographic Knowledge**
```
Input:  "Mizoram hi India rama <mask> tak a ni"
Top Predictions:
  • pawimawh (important) - 9.0%
  • State - 4.9%
  • ropui (big) - 4.5%
```

**Example 2: Urban Context**
```
Input:  "Aizawl hi Mizoram <mask> a ni"
Top Predictions:
  • khawpui (city) ✓ - 12.9%
  • ta - 5.1%
  • chhung - 3.9%

✓ Correctly identifies Aizawl as a city (khawpui)
```

### Comparison with Multilingual Models

While we haven't performed direct evaluation against multilingual models on this test set, similar monolingual approaches for low-resource languages (e.g., KhasiBERT for Khasi) have shown 45-50× improvements in perplexity over multilingual baselines like mBERT and XLM-RoBERTa. We expect Mizo-RoBERTa to demonstrate comparable advantages for Mizo language tasks.

## Usage

### Installation
```bash
pip install transformers torch
```

### Quick Start: Masked Language Modeling
```python
from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline

# Load model and tokenizer
model = RobertaForMaskedLM.from_pretrained("MWireLabs/mizo-roberta")
tokenizer = RobertaTokenizerFast.from_pretrained("MWireLabs/mizo-roberta")

# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# Predict masked words
text = "Mizoram hi <mask> rama state a ni"
results = fill_mask(text)

for result in results:
    print(f"{result['score']:.3f}: {result['sequence']}")
```

### Extract Embeddings
```python
import torch

# Encode text
text = "Mizo tawng hi kan hman thin a ni"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Get contextualized embeddings
model.eval()
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

    # Use last hidden state
    last_hidden = outputs.hidden_states[-1]

    # Mean pooling for sentence embedding
    sentence_embedding = last_hidden.mean(dim=1)

print(f"Embedding shape: {sentence_embedding.shape}")
# Output: torch.Size([1, 768])
```

### Fine-tuning for Classification
```python
from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# Load model for sequence classification
model = RobertaForSequenceClassification.from_pretrained(
    "MWireLabs/mizo-roberta",
    num_labels=3  # e.g., for sentiment: positive, neutral, negative
)

# Load your labeled dataset
# Example: sentiment analysis dataset
dataset = load_dataset("your-dataset-name")

# Tokenize
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
)

# Train
trainer.train()
```

### Batch Processing
```python
# Process multiple sentences efficiently
sentences = [
    "Aizawl hi Mizoram khawpui ber a ni",
    "Mizo tawng hi Mizoram official language a ni",
    "India ram Northeast a Mizoram hi a awm"
]

# Tokenize batch
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)

# Process outputs as needed
```

## Applications

Mizo-RoBERTa can be fine-tuned for various downstream NLP tasks:

- **Text Classification** (sentiment analysis, topic classification, news categorization)
- **Named Entity Recognition** (NER for Mizo entities)
- **Question Answering** (extractive QA systems)
- **Semantic Similarity** (sentence/document similarity)
- **Information Retrieval** (semantic search in Mizo content)
- **Language Understanding** (natural language inference, textual entailment)

## Limitations

- **Dialectal Coverage**: The model may not comprehensively represent all Mizo dialects
- **Domain Balance**: Formal written text may be overrepresented compared to conversational Mizo
- **Pretraining Objective**: Only trained with Masked Language Modeling (MLM); may benefit from additional objectives
- **Context Length**: Limited to 512 tokens; longer documents require chunking
- **Low-resource Constraints**: While large for Mizo, the training corpus is still smaller than high-resource language datasets

## Ethical Considerations

- **Representation**: The model reflects the content and potential biases present in the training corpus
- **Intended Use**: Designed for research and applications that benefit Mizo language speakers
- **Misuse Potential**: Should not be used for generating misleading information or harmful content
- **Data Privacy**: Training data was collected from publicly available sources; no private information was used
- **Cultural Sensitivity**: Users should be aware of cultural context when deploying for Mizo-speaking communities

## Citation

If you use Mizo-RoBERTa in your research or applications, please cite:
```bibtex
@misc{mizoroberta2025,
  title={Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language},
  author={MWireLabs},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/MWireLabs/mizo-roberta}}
}
```

## Related Resources

- **Public Training Data**: [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
- **Sister Model**: [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) - RoBERTa model for Khasi language
- **Organization**: [MWireLabs on HuggingFace](https://huggingface.co/MWireLabs)

## Model Card Contact

For questions, issues, or collaboration opportunities:
- **Organization**: MWireLabs
- **Email**: Contact through HuggingFace
- **Issues**: Report on the model's HuggingFace page

## License

This model is released under the Apache 2.0 License. See LICENSE file for details.

## Acknowledgments

We thank the Mizo language community and content creators whose publicly available work made this model possible. Special thanks to all contributors to the open-source NLP ecosystem, particularly the HuggingFace team for their excellent tools and infrastructure.

---

**MWireLabs** - Building AI for Northeast India 🚀