Update README.md

36fb5d0 verified 5 months ago

10.2 kB


	# Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language

	<div align="center">

	[![Model](https://img.shields.io/badge/🤗-Model-yellow)](https://huggingface.co/MWireLabs/mizo-roberta)
	[![Dataset](https://img.shields.io/badge/🤗-Public%20Dataset-blue)](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
	[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)

	Advancing NLP for Northeast Indian Languages

	</div>

	## Overview

	Mizo-RoBERTa is a transformer-based language model for Mizo, a Tibeto-Burman language spoken by approximately 1.1 million people primarily in Mizoram, Northeast India. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications.

	This work is part of MWireLabs' initiative to develop foundational language models for underserved languages of Northeast India, following our successful [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) model.

	### Key Highlights

	- Architecture: RoBERTa-base (110M parameters)
	- Training Scale: 5.94M sentences, 138.7M tokens
	- Open Data: 4M sentences publicly available at [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
	- Custom Tokenizer: Trained specifically for Mizo (30K BPE vocabulary)
	- Efficient: Single-epoch training on A40 GPU
	- Open Source: Model, tokenizer, and training code publicly available

	## Model Details

	### Architecture

	\| Component \| Specification \|
	\|-----------\|--------------\|
	\| Base Architecture \| RoBERTa-base \|
	\| Parameters \| 109,113,648 (~110M) \|
	\| Layers \| 12 transformer layers \|
	\| Attention Heads \| 12 \|
	\| Hidden Size \| 768 \|
	\| Intermediate Size \| 3,072 \|
	\| Max Sequence Length \| 512 tokens \|
	\| Vocabulary Size \| 30,000 (custom BPE) \|

	### Training Configuration

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Training Data \| 5.94M sentences (138.7M tokens) \|
	\| Public Dataset \| 4M sentences available on HuggingFace \|
	\| Batch Size \| 32 per device \|
	\| Learning Rate \| 1e-4 \|
	\| Optimizer \| AdamW \|
	\| Weight Decay \| 0.01 \|
	\| Warmup Steps \| 10,000 \|
	\| Training Epochs \| 2 \|
	\| Hardware \| 1x NVIDIA A40 (48GB) \|
	\| Training Time \| ~4-6 hours \|
	\| Precision \| Mixed (FP16) \|

	## Training Data

	Trained on a large-scale Mizo corpus comprising 5.94 million sentences (138.7 million tokens) with an average of 23.3 tokens per sentence. The corpus includes:

	- News articles from major Mizo publications
	- Literature and written content
	- Social media text
	- Government documents and official communications
	- Web content from Mizo language websites

	Public Dataset: 4 million sentences are openly available at [MWireLabs/mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) for research and development purposes.

	### Data Preprocessing

	- Unicode normalization
	- Language identification and filtering
	- Deduplication (exact and near-duplicate removal)
	- Quality filtering based on length and character distributions
	- Custom sentence segmentation for Mizo punctuation

	### Data Split

	- Training: 5,350,122 sentences (90%)
	- Validation: 297,229 sentences (5%)
	- Test: 297,230 sentences (5%)

	## Performance

	### Language Modeling

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Test Perplexity \| 15.85 \|
	\| Test Loss \| 2.76 \|

	### Qualitative Examples

	The model demonstrates strong understanding of Mizo linguistic patterns and context:

	Example 1: Geographic Knowledge
	```
	Input: "Mizoram hi India rama <mask> tak a ni"
	Top Predictions:
	• pawimawh (important) - 9.0%
	• State - 4.9%
	• ropui (big) - 4.5%
	```

	Example 2: Urban Context
	```
	Input: "Aizawl hi Mizoram <mask> a ni"
	Top Predictions:
	• khawpui (city) ✓ - 12.9%
	• ta - 5.1%
	• chhung - 3.9%

	✓ Correctly identifies Aizawl as a city (khawpui)
	```

	### Comparison with Multilingual Models

	While we haven't performed direct evaluation against multilingual models on this test set, similar monolingual approaches for low-resource languages (e.g., KhasiBERT for Khasi) have shown 45-50× improvements in perplexity over multilingual baselines like mBERT and XLM-RoBERTa. We expect Mizo-RoBERTa to demonstrate comparable advantages for Mizo language tasks.

	## Usage

	### Installation
	```bash
	pip install transformers torch
	```

	### Quick Start: Masked Language Modeling
	```python
	from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline

	# Load model and tokenizer
	model = RobertaForMaskedLM.from_pretrained("MWireLabs/mizo-roberta")
	tokenizer = RobertaTokenizerFast.from_pretrained("MWireLabs/mizo-roberta")

	# Create fill-mask pipeline
	fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

	# Predict masked words
	text = "Mizoram hi <mask> rama state a ni"
	results = fill_mask(text)

	for result in results:
	print(f"{result['score']:.3f}: {result['sequence']}")
	```

	### Extract Embeddings
	```python
	import torch

	# Encode text
	text = "Mizo tawng hi kan hman thin a ni"
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

	# Get contextualized embeddings
	model.eval()
	with torch.no_grad():
	outputs = model(**inputs, output_hidden_states=True)

	# Use last hidden state
	last_hidden = outputs.hidden_states[-1]

	# Mean pooling for sentence embedding
	sentence_embedding = last_hidden.mean(dim=1)

	print(f"Embedding shape: {sentence_embedding.shape}")
	# Output: torch.Size([1, 768])
	```

	### Fine-tuning for Classification
	```python
	from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments
	from datasets import load_dataset

	# Load model for sequence classification
	model = RobertaForSequenceClassification.from_pretrained(
	"MWireLabs/mizo-roberta",
	num_labels=3 # e.g., for sentiment: positive, neutral, negative
	)

	# Load your labeled dataset
	# Example: sentiment analysis dataset
	dataset = load_dataset("your-dataset-name")

	# Tokenize
	def tokenize_function(examples):
	return tokenizer(examples["text"], padding="max_length", truncation=True)

	tokenized_dataset = dataset.map(tokenize_function, batched=True)

	# Training arguments
	training_args = TrainingArguments(
	output_dir="./results",
	num_train_epochs=3,
	per_device_train_batch_size=16,
	per_device_eval_batch_size=64,
	warmup_steps=500,
	weight_decay=0.01,
	logging_dir='./logs',
	logging_steps=100,
	evaluation_strategy="epoch",
	save_strategy="epoch",
	load_best_model_at_end=True,
	)

	# Initialize trainer
	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=tokenized_dataset["train"],
	eval_dataset=tokenized_dataset["validation"],
	)

	# Train
	trainer.train()
	```

	### Batch Processing
	```python
	# Process multiple sentences efficiently
	sentences = [
	"Aizawl hi Mizoram khawpui ber a ni",
	"Mizo tawng hi Mizoram official language a ni",
	"India ram Northeast a Mizoram hi a awm"
	]

	# Tokenize batch
	inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

	# Get predictions
	with torch.no_grad():
	outputs = model(**inputs)

	# Process outputs as needed
	```

	## Applications

	Mizo-RoBERTa can be fine-tuned for various downstream NLP tasks:

	- Text Classification (sentiment analysis, topic classification, news categorization)
	- Named Entity Recognition (NER for Mizo entities)
	- Question Answering (extractive QA systems)
	- Semantic Similarity (sentence/document similarity)
	- Information Retrieval (semantic search in Mizo content)
	- Language Understanding (natural language inference, textual entailment)

	## Limitations

	- Dialectal Coverage: The model may not comprehensively represent all Mizo dialects
	- Domain Balance: Formal written text may be overrepresented compared to conversational Mizo
	- Pretraining Objective: Only trained with Masked Language Modeling (MLM); may benefit from additional objectives
	- Context Length: Limited to 512 tokens; longer documents require chunking
	- Low-resource Constraints: While large for Mizo, the training corpus is still smaller than high-resource language datasets

	## Ethical Considerations

	- Representation: The model reflects the content and potential biases present in the training corpus
	- Intended Use: Designed for research and applications that benefit Mizo language speakers
	- Misuse Potential: Should not be used for generating misleading information or harmful content
	- Data Privacy: Training data was collected from publicly available sources; no private information was used
	- Cultural Sensitivity: Users should be aware of cultural context when deploying for Mizo-speaking communities

	## Citation

	If you use Mizo-RoBERTa in your research or applications, please cite:
	```bibtex
	@misc{mizoroberta2025,
	title={Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language},
	author={MWireLabs},
	year={2025},
	publisher={HuggingFace},
	howpublished={\url{https://huggingface.co/MWireLabs/mizo-roberta}}
	}
	```

	## Related Resources

	- Public Training Data: [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
	- Sister Model: [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) - RoBERTa model for Khasi language
	- Organization: [MWireLabs on HuggingFace](https://huggingface.co/MWireLabs)

	## Model Card Contact

	For questions, issues, or collaboration opportunities:
	- Organization: MWireLabs
	- Email: Contact through HuggingFace
	- Issues: Report on the model's HuggingFace page

	## License

	This model is released under the Apache 2.0 License. See LICENSE file for details.

	## Acknowledgments

	We thank the Mizo language community and content creators whose publicly available work made this model possible. Special thanks to all contributors to the open-source NLP ecosystem, particularly the HuggingFace team for their excellent tools and infrastructure.

	---

	MWireLabs - Building AI for Northeast India 🚀