File size: 10,170 Bytes
4feb188 0085b7e 4feb188 0085b7e 4feb188 36fb5d0 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 639c76b 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 36fb5d0 0085b7e 36fb5d0 0085b7e 639c76b 0085b7e 36fb5d0 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e 4feb188 0085b7e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
# Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language
<div align="center">
[](https://huggingface.co/MWireLabs/mizo-roberta)
[](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
[](https://www.apache.org/licenses/LICENSE-2.0)
*Advancing NLP for Northeast Indian Languages*
</div>
## Overview
**Mizo-RoBERTa** is a transformer-based language model for Mizo, a Tibeto-Burman language spoken by approximately 1.1 million people primarily in Mizoram, Northeast India. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications.
This work is part of MWireLabs' initiative to develop foundational language models for underserved languages of Northeast India, following our successful [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) model.
### Key Highlights
- **Architecture**: RoBERTa-base (110M parameters)
- **Training Scale**: 5.94M sentences, 138.7M tokens
- **Open Data**: 4M sentences publicly available at [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
- **Custom Tokenizer**: Trained specifically for Mizo (30K BPE vocabulary)
- **Efficient**: Single-epoch training on A40 GPU
- **Open Source**: Model, tokenizer, and training code publicly available
## Model Details
### Architecture
| Component | Specification |
|-----------|--------------|
| Base Architecture | RoBERTa-base |
| Parameters | 109,113,648 (~110M) |
| Layers | 12 transformer layers |
| Attention Heads | 12 |
| Hidden Size | 768 |
| Intermediate Size | 3,072 |
| Max Sequence Length | 512 tokens |
| Vocabulary Size | 30,000 (custom BPE) |
### Training Configuration
| Setting | Value |
|---------|-------|
| Training Data | 5.94M sentences (138.7M tokens) |
| Public Dataset | 4M sentences available on HuggingFace |
| Batch Size | 32 per device |
| Learning Rate | 1e-4 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| Warmup Steps | 10,000 |
| Training Epochs | 2 |
| Hardware | 1x NVIDIA A40 (48GB) |
| Training Time | ~4-6 hours |
| Precision | Mixed (FP16) |
## Training Data
Trained on a large-scale Mizo corpus comprising 5.94 million sentences (138.7 million tokens) with an average of 23.3 tokens per sentence. The corpus includes:
- **News articles** from major Mizo publications
- **Literature** and written content
- **Social media** text
- **Government documents** and official communications
- **Web content** from Mizo language websites
**Public Dataset**: 4 million sentences are openly available at [MWireLabs/mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) for research and development purposes.
### Data Preprocessing
- Unicode normalization
- Language identification and filtering
- Deduplication (exact and near-duplicate removal)
- Quality filtering based on length and character distributions
- Custom sentence segmentation for Mizo punctuation
### Data Split
- **Training**: 5,350,122 sentences (90%)
- **Validation**: 297,229 sentences (5%)
- **Test**: 297,230 sentences (5%)
## Performance
### Language Modeling
| Metric | Value |
|--------|-------|
| Test Perplexity | 15.85 |
| Test Loss | 2.76 |
### Qualitative Examples
The model demonstrates strong understanding of Mizo linguistic patterns and context:
**Example 1: Geographic Knowledge**
```
Input: "Mizoram hi India rama <mask> tak a ni"
Top Predictions:
• pawimawh (important) - 9.0%
• State - 4.9%
• ropui (big) - 4.5%
```
**Example 2: Urban Context**
```
Input: "Aizawl hi Mizoram <mask> a ni"
Top Predictions:
• khawpui (city) ✓ - 12.9%
• ta - 5.1%
• chhung - 3.9%
✓ Correctly identifies Aizawl as a city (khawpui)
```
### Comparison with Multilingual Models
While we haven't performed direct evaluation against multilingual models on this test set, similar monolingual approaches for low-resource languages (e.g., KhasiBERT for Khasi) have shown 45-50× improvements in perplexity over multilingual baselines like mBERT and XLM-RoBERTa. We expect Mizo-RoBERTa to demonstrate comparable advantages for Mizo language tasks.
## Usage
### Installation
```bash
pip install transformers torch
```
### Quick Start: Masked Language Modeling
```python
from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline
# Load model and tokenizer
model = RobertaForMaskedLM.from_pretrained("MWireLabs/mizo-roberta")
tokenizer = RobertaTokenizerFast.from_pretrained("MWireLabs/mizo-roberta")
# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
# Predict masked words
text = "Mizoram hi <mask> rama state a ni"
results = fill_mask(text)
for result in results:
print(f"{result['score']:.3f}: {result['sequence']}")
```
### Extract Embeddings
```python
import torch
# Encode text
text = "Mizo tawng hi kan hman thin a ni"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# Get contextualized embeddings
model.eval()
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Use last hidden state
last_hidden = outputs.hidden_states[-1]
# Mean pooling for sentence embedding
sentence_embedding = last_hidden.mean(dim=1)
print(f"Embedding shape: {sentence_embedding.shape}")
# Output: torch.Size([1, 768])
```
### Fine-tuning for Classification
```python
from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load model for sequence classification
model = RobertaForSequenceClassification.from_pretrained(
"MWireLabs/mizo-roberta",
num_labels=3 # e.g., for sentiment: positive, neutral, negative
)
# Load your labeled dataset
# Example: sentiment analysis dataset
dataset = load_dataset("your-dataset-name")
# Tokenize
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=100,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
)
# Train
trainer.train()
```
### Batch Processing
```python
# Process multiple sentences efficiently
sentences = [
"Aizawl hi Mizoram khawpui ber a ni",
"Mizo tawng hi Mizoram official language a ni",
"India ram Northeast a Mizoram hi a awm"
]
# Tokenize batch
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
# Process outputs as needed
```
## Applications
Mizo-RoBERTa can be fine-tuned for various downstream NLP tasks:
- **Text Classification** (sentiment analysis, topic classification, news categorization)
- **Named Entity Recognition** (NER for Mizo entities)
- **Question Answering** (extractive QA systems)
- **Semantic Similarity** (sentence/document similarity)
- **Information Retrieval** (semantic search in Mizo content)
- **Language Understanding** (natural language inference, textual entailment)
## Limitations
- **Dialectal Coverage**: The model may not comprehensively represent all Mizo dialects
- **Domain Balance**: Formal written text may be overrepresented compared to conversational Mizo
- **Pretraining Objective**: Only trained with Masked Language Modeling (MLM); may benefit from additional objectives
- **Context Length**: Limited to 512 tokens; longer documents require chunking
- **Low-resource Constraints**: While large for Mizo, the training corpus is still smaller than high-resource language datasets
## Ethical Considerations
- **Representation**: The model reflects the content and potential biases present in the training corpus
- **Intended Use**: Designed for research and applications that benefit Mizo language speakers
- **Misuse Potential**: Should not be used for generating misleading information or harmful content
- **Data Privacy**: Training data was collected from publicly available sources; no private information was used
- **Cultural Sensitivity**: Users should be aware of cultural context when deploying for Mizo-speaking communities
## Citation
If you use Mizo-RoBERTa in your research or applications, please cite:
```bibtex
@misc{mizoroberta2025,
title={Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language},
author={MWireLabs},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/MWireLabs/mizo-roberta}}
}
```
## Related Resources
- **Public Training Data**: [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
- **Sister Model**: [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) - RoBERTa model for Khasi language
- **Organization**: [MWireLabs on HuggingFace](https://huggingface.co/MWireLabs)
## Model Card Contact
For questions, issues, or collaboration opportunities:
- **Organization**: MWireLabs
- **Email**: Contact through HuggingFace
- **Issues**: Report on the model's HuggingFace page
## License
This model is released under the Apache 2.0 License. See LICENSE file for details.
## Acknowledgments
We thank the Mizo language community and content creators whose publicly available work made this model possible. Special thanks to all contributors to the open-source NLP ecosystem, particularly the HuggingFace team for their excellent tools and infrastructure.
---
**MWireLabs** - Building AI for Northeast India 🚀
|