| # Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language | |
| <div align="center"> | |
| [](https://huggingface.co/MWireLabs/mizo-roberta) | |
| [](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) | |
| [](https://www.apache.org/licenses/LICENSE-2.0) | |
| *Advancing NLP for Northeast Indian Languages* | |
| </div> | |
| ## Overview | |
| **Mizo-RoBERTa** is a transformer-based language model for Mizo, a Tibeto-Burman language spoken by approximately 1.1 million people primarily in Mizoram, Northeast India. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications. | |
| This work is part of MWireLabs' initiative to develop foundational language models for underserved languages of Northeast India, following our successful [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) model. | |
| ### Key Highlights | |
| - **Architecture**: RoBERTa-base (110M parameters) | |
| - **Training Scale**: 5.94M sentences, 138.7M tokens | |
| - **Open Data**: 4M sentences publicly available at [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) | |
| - **Custom Tokenizer**: Trained specifically for Mizo (30K BPE vocabulary) | |
| - **Efficient**: Single-epoch training on A40 GPU | |
| - **Open Source**: Model, tokenizer, and training code publicly available | |
| ## Model Details | |
| ### Architecture | |
| | Component | Specification | | |
| |-----------|--------------| | |
| | Base Architecture | RoBERTa-base | | |
| | Parameters | 109,113,648 (~110M) | | |
| | Layers | 12 transformer layers | | |
| | Attention Heads | 12 | | |
| | Hidden Size | 768 | | |
| | Intermediate Size | 3,072 | | |
| | Max Sequence Length | 512 tokens | | |
| | Vocabulary Size | 30,000 (custom BPE) | | |
| ### Training Configuration | |
| | Setting | Value | | |
| |---------|-------| | |
| | Training Data | 5.94M sentences (138.7M tokens) | | |
| | Public Dataset | 4M sentences available on HuggingFace | | |
| | Batch Size | 32 per device | | |
| | Learning Rate | 1e-4 | | |
| | Optimizer | AdamW | | |
| | Weight Decay | 0.01 | | |
| | Warmup Steps | 10,000 | | |
| | Training Epochs | 2 | | |
| | Hardware | 1x NVIDIA A40 (48GB) | | |
| | Training Time | ~4-6 hours | | |
| | Precision | Mixed (FP16) | | |
| ## Training Data | |
| Trained on a large-scale Mizo corpus comprising 5.94 million sentences (138.7 million tokens) with an average of 23.3 tokens per sentence. The corpus includes: | |
| - **News articles** from major Mizo publications | |
| - **Literature** and written content | |
| - **Social media** text | |
| - **Government documents** and official communications | |
| - **Web content** from Mizo language websites | |
| **Public Dataset**: 4 million sentences are openly available at [MWireLabs/mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) for research and development purposes. | |
| ### Data Preprocessing | |
| - Unicode normalization | |
| - Language identification and filtering | |
| - Deduplication (exact and near-duplicate removal) | |
| - Quality filtering based on length and character distributions | |
| - Custom sentence segmentation for Mizo punctuation | |
| ### Data Split | |
| - **Training**: 5,350,122 sentences (90%) | |
| - **Validation**: 297,229 sentences (5%) | |
| - **Test**: 297,230 sentences (5%) | |
| ## Performance | |
| ### Language Modeling | |
| | Metric | Value | | |
| |--------|-------| | |
| | Test Perplexity | 15.85 | | |
| | Test Loss | 2.76 | | |
| ### Qualitative Examples | |
| The model demonstrates strong understanding of Mizo linguistic patterns and context: | |
| **Example 1: Geographic Knowledge** | |
| ``` | |
| Input: "Mizoram hi India rama <mask> tak a ni" | |
| Top Predictions: | |
| • pawimawh (important) - 9.0% | |
| • State - 4.9% | |
| • ropui (big) - 4.5% | |
| ``` | |
| **Example 2: Urban Context** | |
| ``` | |
| Input: "Aizawl hi Mizoram <mask> a ni" | |
| Top Predictions: | |
| • khawpui (city) ✓ - 12.9% | |
| • ta - 5.1% | |
| • chhung - 3.9% | |
| ✓ Correctly identifies Aizawl as a city (khawpui) | |
| ``` | |
| ### Comparison with Multilingual Models | |
| While we haven't performed direct evaluation against multilingual models on this test set, similar monolingual approaches for low-resource languages (e.g., KhasiBERT for Khasi) have shown 45-50× improvements in perplexity over multilingual baselines like mBERT and XLM-RoBERTa. We expect Mizo-RoBERTa to demonstrate comparable advantages for Mizo language tasks. | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install transformers torch | |
| ``` | |
| ### Quick Start: Masked Language Modeling | |
| ```python | |
| from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline | |
| # Load model and tokenizer | |
| model = RobertaForMaskedLM.from_pretrained("MWireLabs/mizo-roberta") | |
| tokenizer = RobertaTokenizerFast.from_pretrained("MWireLabs/mizo-roberta") | |
| # Create fill-mask pipeline | |
| fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer) | |
| # Predict masked words | |
| text = "Mizoram hi <mask> rama state a ni" | |
| results = fill_mask(text) | |
| for result in results: | |
| print(f"{result['score']:.3f}: {result['sequence']}") | |
| ``` | |
| ### Extract Embeddings | |
| ```python | |
| import torch | |
| # Encode text | |
| text = "Mizo tawng hi kan hman thin a ni" | |
| inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) | |
| # Get contextualized embeddings | |
| model.eval() | |
| with torch.no_grad(): | |
| outputs = model(**inputs, output_hidden_states=True) | |
| # Use last hidden state | |
| last_hidden = outputs.hidden_states[-1] | |
| # Mean pooling for sentence embedding | |
| sentence_embedding = last_hidden.mean(dim=1) | |
| print(f"Embedding shape: {sentence_embedding.shape}") | |
| # Output: torch.Size([1, 768]) | |
| ``` | |
| ### Fine-tuning for Classification | |
| ```python | |
| from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments | |
| from datasets import load_dataset | |
| # Load model for sequence classification | |
| model = RobertaForSequenceClassification.from_pretrained( | |
| "MWireLabs/mizo-roberta", | |
| num_labels=3 # e.g., for sentiment: positive, neutral, negative | |
| ) | |
| # Load your labeled dataset | |
| # Example: sentiment analysis dataset | |
| dataset = load_dataset("your-dataset-name") | |
| # Tokenize | |
| def tokenize_function(examples): | |
| return tokenizer(examples["text"], padding="max_length", truncation=True) | |
| tokenized_dataset = dataset.map(tokenize_function, batched=True) | |
| # Training arguments | |
| training_args = TrainingArguments( | |
| output_dir="./results", | |
| num_train_epochs=3, | |
| per_device_train_batch_size=16, | |
| per_device_eval_batch_size=64, | |
| warmup_steps=500, | |
| weight_decay=0.01, | |
| logging_dir='./logs', | |
| logging_steps=100, | |
| evaluation_strategy="epoch", | |
| save_strategy="epoch", | |
| load_best_model_at_end=True, | |
| ) | |
| # Initialize trainer | |
| trainer = Trainer( | |
| model=model, | |
| args=training_args, | |
| train_dataset=tokenized_dataset["train"], | |
| eval_dataset=tokenized_dataset["validation"], | |
| ) | |
| # Train | |
| trainer.train() | |
| ``` | |
| ### Batch Processing | |
| ```python | |
| # Process multiple sentences efficiently | |
| sentences = [ | |
| "Aizawl hi Mizoram khawpui ber a ni", | |
| "Mizo tawng hi Mizoram official language a ni", | |
| "India ram Northeast a Mizoram hi a awm" | |
| ] | |
| # Tokenize batch | |
| inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") | |
| # Get predictions | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| # Process outputs as needed | |
| ``` | |
| ## Applications | |
| Mizo-RoBERTa can be fine-tuned for various downstream NLP tasks: | |
| - **Text Classification** (sentiment analysis, topic classification, news categorization) | |
| - **Named Entity Recognition** (NER for Mizo entities) | |
| - **Question Answering** (extractive QA systems) | |
| - **Semantic Similarity** (sentence/document similarity) | |
| - **Information Retrieval** (semantic search in Mizo content) | |
| - **Language Understanding** (natural language inference, textual entailment) | |
| ## Limitations | |
| - **Dialectal Coverage**: The model may not comprehensively represent all Mizo dialects | |
| - **Domain Balance**: Formal written text may be overrepresented compared to conversational Mizo | |
| - **Pretraining Objective**: Only trained with Masked Language Modeling (MLM); may benefit from additional objectives | |
| - **Context Length**: Limited to 512 tokens; longer documents require chunking | |
| - **Low-resource Constraints**: While large for Mizo, the training corpus is still smaller than high-resource language datasets | |
| ## Ethical Considerations | |
| - **Representation**: The model reflects the content and potential biases present in the training corpus | |
| - **Intended Use**: Designed for research and applications that benefit Mizo language speakers | |
| - **Misuse Potential**: Should not be used for generating misleading information or harmful content | |
| - **Data Privacy**: Training data was collected from publicly available sources; no private information was used | |
| - **Cultural Sensitivity**: Users should be aware of cultural context when deploying for Mizo-speaking communities | |
| ## Citation | |
| If you use Mizo-RoBERTa in your research or applications, please cite: | |
| ```bibtex | |
| @misc{mizoroberta2025, | |
| title={Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language}, | |
| author={MWireLabs}, | |
| year={2025}, | |
| publisher={HuggingFace}, | |
| howpublished={\url{https://huggingface.co/MWireLabs/mizo-roberta}} | |
| } | |
| ``` | |
| ## Related Resources | |
| - **Public Training Data**: [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) | |
| - **Sister Model**: [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) - RoBERTa model for Khasi language | |
| - **Organization**: [MWireLabs on HuggingFace](https://huggingface.co/MWireLabs) | |
| ## Model Card Contact | |
| For questions, issues, or collaboration opportunities: | |
| - **Organization**: MWireLabs | |
| - **Email**: Contact through HuggingFace | |
| - **Issues**: Report on the model's HuggingFace page | |
| ## License | |
| This model is released under the Apache 2.0 License. See LICENSE file for details. | |
| ## Acknowledgments | |
| We thank the Mizo language community and content creators whose publicly available work made this model possible. Special thanks to all contributors to the open-source NLP ecosystem, particularly the HuggingFace team for their excellent tools and infrastructure. | |
| --- | |
| **MWireLabs** - Building AI for Northeast India 🚀 | |