# Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language
[![Model](https://img.shields.io/badge/🤗-Model-yellow)](https://huggingface.co/MWireLabs/mizo-roberta) [![Dataset](https://img.shields.io/badge/🤗-Public%20Dataset-blue)](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0) *Advancing NLP for Northeast Indian Languages*
## Overview **Mizo-RoBERTa** is a transformer-based language model for Mizo, a Tibeto-Burman language spoken by approximately 1.1 million people primarily in Mizoram, Northeast India. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications. This work is part of MWireLabs' initiative to develop foundational language models for underserved languages of Northeast India, following our successful [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) model. ### Key Highlights - **Architecture**: RoBERTa-base (110M parameters) - **Training Scale**: 5.94M sentences, 138.7M tokens - **Open Data**: 4M sentences publicly available at [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) - **Custom Tokenizer**: Trained specifically for Mizo (30K BPE vocabulary) - **Efficient**: Single-epoch training on A40 GPU - **Open Source**: Model, tokenizer, and training code publicly available ## Model Details ### Architecture | Component | Specification | |-----------|--------------| | Base Architecture | RoBERTa-base | | Parameters | 109,113,648 (~110M) | | Layers | 12 transformer layers | | Attention Heads | 12 | | Hidden Size | 768 | | Intermediate Size | 3,072 | | Max Sequence Length | 512 tokens | | Vocabulary Size | 30,000 (custom BPE) | ### Training Configuration | Setting | Value | |---------|-------| | Training Data | 5.94M sentences (138.7M tokens) | | Public Dataset | 4M sentences available on HuggingFace | | Batch Size | 32 per device | | Learning Rate | 1e-4 | | Optimizer | AdamW | | Weight Decay | 0.01 | | Warmup Steps | 10,000 | | Training Epochs | 2 | | Hardware | 1x NVIDIA A40 (48GB) | | Training Time | ~4-6 hours | | Precision | Mixed (FP16) | ## Training Data Trained on a large-scale Mizo corpus comprising 5.94 million sentences (138.7 million tokens) with an average of 23.3 tokens per sentence. The corpus includes: - **News articles** from major Mizo publications - **Literature** and written content - **Social media** text - **Government documents** and official communications - **Web content** from Mizo language websites **Public Dataset**: 4 million sentences are openly available at [MWireLabs/mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) for research and development purposes. ### Data Preprocessing - Unicode normalization - Language identification and filtering - Deduplication (exact and near-duplicate removal) - Quality filtering based on length and character distributions - Custom sentence segmentation for Mizo punctuation ### Data Split - **Training**: 5,350,122 sentences (90%) - **Validation**: 297,229 sentences (5%) - **Test**: 297,230 sentences (5%) ## Performance ### Language Modeling | Metric | Value | |--------|-------| | Test Perplexity | 15.85 | | Test Loss | 2.76 | ### Qualitative Examples The model demonstrates strong understanding of Mizo linguistic patterns and context: **Example 1: Geographic Knowledge** ``` Input: "Mizoram hi India rama tak a ni" Top Predictions: • pawimawh (important) - 9.0% • State - 4.9% • ropui (big) - 4.5% ``` **Example 2: Urban Context** ``` Input: "Aizawl hi Mizoram a ni" Top Predictions: • khawpui (city) ✓ - 12.9% • ta - 5.1% • chhung - 3.9% ✓ Correctly identifies Aizawl as a city (khawpui) ``` ### Comparison with Multilingual Models While we haven't performed direct evaluation against multilingual models on this test set, similar monolingual approaches for low-resource languages (e.g., KhasiBERT for Khasi) have shown 45-50× improvements in perplexity over multilingual baselines like mBERT and XLM-RoBERTa. We expect Mizo-RoBERTa to demonstrate comparable advantages for Mizo language tasks. ## Usage ### Installation ```bash pip install transformers torch ``` ### Quick Start: Masked Language Modeling ```python from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline # Load model and tokenizer model = RobertaForMaskedLM.from_pretrained("MWireLabs/mizo-roberta") tokenizer = RobertaTokenizerFast.from_pretrained("MWireLabs/mizo-roberta") # Create fill-mask pipeline fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer) # Predict masked words text = "Mizoram hi rama state a ni" results = fill_mask(text) for result in results: print(f"{result['score']:.3f}: {result['sequence']}") ``` ### Extract Embeddings ```python import torch # Encode text text = "Mizo tawng hi kan hman thin a ni" inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) # Get contextualized embeddings model.eval() with torch.no_grad(): outputs = model(**inputs, output_hidden_states=True) # Use last hidden state last_hidden = outputs.hidden_states[-1] # Mean pooling for sentence embedding sentence_embedding = last_hidden.mean(dim=1) print(f"Embedding shape: {sentence_embedding.shape}") # Output: torch.Size([1, 768]) ``` ### Fine-tuning for Classification ```python from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments from datasets import load_dataset # Load model for sequence classification model = RobertaForSequenceClassification.from_pretrained( "MWireLabs/mizo-roberta", num_labels=3 # e.g., for sentiment: positive, neutral, negative ) # Load your labeled dataset # Example: sentiment analysis dataset dataset = load_dataset("your-dataset-name") # Tokenize def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True) tokenized_dataset = dataset.map(tokenize_function, batched=True) # Training arguments training_args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=64, warmup_steps=500, weight_decay=0.01, logging_dir='./logs', logging_steps=100, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, ) # Initialize trainer trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_dataset["train"], eval_dataset=tokenized_dataset["validation"], ) # Train trainer.train() ``` ### Batch Processing ```python # Process multiple sentences efficiently sentences = [ "Aizawl hi Mizoram khawpui ber a ni", "Mizo tawng hi Mizoram official language a ni", "India ram Northeast a Mizoram hi a awm" ] # Tokenize batch inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") # Get predictions with torch.no_grad(): outputs = model(**inputs) # Process outputs as needed ``` ## Applications Mizo-RoBERTa can be fine-tuned for various downstream NLP tasks: - **Text Classification** (sentiment analysis, topic classification, news categorization) - **Named Entity Recognition** (NER for Mizo entities) - **Question Answering** (extractive QA systems) - **Semantic Similarity** (sentence/document similarity) - **Information Retrieval** (semantic search in Mizo content) - **Language Understanding** (natural language inference, textual entailment) ## Limitations - **Dialectal Coverage**: The model may not comprehensively represent all Mizo dialects - **Domain Balance**: Formal written text may be overrepresented compared to conversational Mizo - **Pretraining Objective**: Only trained with Masked Language Modeling (MLM); may benefit from additional objectives - **Context Length**: Limited to 512 tokens; longer documents require chunking - **Low-resource Constraints**: While large for Mizo, the training corpus is still smaller than high-resource language datasets ## Ethical Considerations - **Representation**: The model reflects the content and potential biases present in the training corpus - **Intended Use**: Designed for research and applications that benefit Mizo language speakers - **Misuse Potential**: Should not be used for generating misleading information or harmful content - **Data Privacy**: Training data was collected from publicly available sources; no private information was used - **Cultural Sensitivity**: Users should be aware of cultural context when deploying for Mizo-speaking communities ## Citation If you use Mizo-RoBERTa in your research or applications, please cite: ```bibtex @misc{mizoroberta2025, title={Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language}, author={MWireLabs}, year={2025}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/MWireLabs/mizo-roberta}} } ``` ## Related Resources - **Public Training Data**: [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) - **Sister Model**: [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) - RoBERTa model for Khasi language - **Organization**: [MWireLabs on HuggingFace](https://huggingface.co/MWireLabs) ## Model Card Contact For questions, issues, or collaboration opportunities: - **Organization**: MWireLabs - **Email**: Contact through HuggingFace - **Issues**: Report on the model's HuggingFace page ## License This model is released under the Apache 2.0 License. See LICENSE file for details. ## Acknowledgments We thank the Mizo language community and content creators whose publicly available work made this model possible. Special thanks to all contributors to the open-source NLP ecosystem, particularly the HuggingFace team for their excellent tools and infrastructure. --- **MWireLabs** - Building AI for Northeast India 🚀