File size: 10,170 Bytes
4feb188
0085b7e
4feb188
0085b7e
4feb188
36fb5d0
0085b7e
 
4feb188
0085b7e
4feb188
0085b7e
4feb188
0085b7e
4feb188
0085b7e
4feb188
0085b7e
4feb188
0085b7e
4feb188
639c76b
 
 
 
 
 
4feb188
0085b7e
4feb188
0085b7e
4feb188
0085b7e
 
 
 
 
 
 
 
 
 
4feb188
0085b7e
4feb188
0085b7e
 
 
 
 
 
 
 
 
 
 
 
 
4feb188
0085b7e
4feb188
0085b7e
4feb188
0085b7e
 
 
 
 
4feb188
0085b7e
4feb188
0085b7e
4feb188
0085b7e
 
 
 
 
4feb188
0085b7e
 
 
 
 
 
 
4feb188
0085b7e
 
 
 
 
 
4feb188
0085b7e
4feb188
0085b7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36fb5d0
 
0085b7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36fb5d0
0085b7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
639c76b
 
 
 
 
 
0085b7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36fb5d0
0085b7e
 
 
 
 
 
 
 
4feb188
0085b7e
4feb188
0085b7e
 
 
 
4feb188
0085b7e
4feb188
0085b7e
4feb188
0085b7e
4feb188
0085b7e
4feb188
0085b7e
4feb188
0085b7e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303

# Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language

<div align="center">

[![Model](https://img.shields.io/badge/🤗-Model-yellow)](https://huggingface.co/MWireLabs/mizo-roberta)
[![Dataset](https://img.shields.io/badge/🤗-Public%20Dataset-blue)](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)

*Advancing NLP for Northeast Indian Languages*

</div>

## Overview

**Mizo-RoBERTa** is a transformer-based language model for Mizo, a Tibeto-Burman language spoken by approximately 1.1 million people primarily in Mizoram, Northeast India. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications.

This work is part of MWireLabs' initiative to develop foundational language models for underserved languages of Northeast India, following our successful [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) model.

### Key Highlights

- **Architecture**: RoBERTa-base (110M parameters)
- **Training Scale**: 5.94M sentences, 138.7M tokens
- **Open Data**: 4M sentences publicly available at [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
- **Custom Tokenizer**: Trained specifically for Mizo (30K BPE vocabulary)
- **Efficient**: Single-epoch training on A40 GPU
- **Open Source**: Model, tokenizer, and training code publicly available

## Model Details

### Architecture

| Component | Specification |
|-----------|--------------|
| Base Architecture | RoBERTa-base |
| Parameters | 109,113,648 (~110M) |
| Layers | 12 transformer layers |
| Attention Heads | 12 |
| Hidden Size | 768 |
| Intermediate Size | 3,072 |
| Max Sequence Length | 512 tokens |
| Vocabulary Size | 30,000 (custom BPE) |

### Training Configuration

| Setting | Value |
|---------|-------|
| Training Data | 5.94M sentences (138.7M tokens) |
| Public Dataset | 4M sentences available on HuggingFace |
| Batch Size | 32 per device |
| Learning Rate | 1e-4 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| Warmup Steps | 10,000 |
| Training Epochs | 2 |
| Hardware | 1x NVIDIA A40 (48GB) |
| Training Time | ~4-6 hours |
| Precision | Mixed (FP16) |

## Training Data

Trained on a large-scale Mizo corpus comprising 5.94 million sentences (138.7 million tokens) with an average of 23.3 tokens per sentence. The corpus includes:

- **News articles** from major Mizo publications
- **Literature** and written content
- **Social media** text
- **Government documents** and official communications
- **Web content** from Mizo language websites

**Public Dataset**: 4 million sentences are openly available at [MWireLabs/mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) for research and development purposes.

### Data Preprocessing

- Unicode normalization
- Language identification and filtering
- Deduplication (exact and near-duplicate removal)
- Quality filtering based on length and character distributions
- Custom sentence segmentation for Mizo punctuation

### Data Split

- **Training**: 5,350,122 sentences (90%)
- **Validation**: 297,229 sentences (5%)
- **Test**: 297,230 sentences (5%)

## Performance

### Language Modeling

| Metric | Value |
|--------|-------|
| Test Perplexity | 15.85 |
| Test Loss | 2.76 |

### Qualitative Examples

The model demonstrates strong understanding of Mizo linguistic patterns and context:

**Example 1: Geographic Knowledge**
```
Input:  "Mizoram hi India rama <mask> tak a ni"
Top Predictions:
  • pawimawh (important) - 9.0%
  • State - 4.9%
  • ropui (big) - 4.5%
```

**Example 2: Urban Context**
```
Input:  "Aizawl hi Mizoram <mask> a ni"
Top Predictions:
  • khawpui (city) ✓ - 12.9%
  • ta - 5.1%
  • chhung - 3.9%

✓ Correctly identifies Aizawl as a city (khawpui)
```

### Comparison with Multilingual Models

While we haven't performed direct evaluation against multilingual models on this test set, similar monolingual approaches for low-resource languages (e.g., KhasiBERT for Khasi) have shown 45-50× improvements in perplexity over multilingual baselines like mBERT and XLM-RoBERTa. We expect Mizo-RoBERTa to demonstrate comparable advantages for Mizo language tasks.

## Usage

### Installation
```bash
pip install transformers torch
```

### Quick Start: Masked Language Modeling
```python
from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline

# Load model and tokenizer
model = RobertaForMaskedLM.from_pretrained("MWireLabs/mizo-roberta")
tokenizer = RobertaTokenizerFast.from_pretrained("MWireLabs/mizo-roberta")

# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# Predict masked words
text = "Mizoram hi <mask> rama state a ni"
results = fill_mask(text)

for result in results:
    print(f"{result['score']:.3f}: {result['sequence']}")
```

### Extract Embeddings
```python
import torch

# Encode text
text = "Mizo tawng hi kan hman thin a ni"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Get contextualized embeddings
model.eval()
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

    # Use last hidden state
    last_hidden = outputs.hidden_states[-1]

    # Mean pooling for sentence embedding
    sentence_embedding = last_hidden.mean(dim=1)

print(f"Embedding shape: {sentence_embedding.shape}")
# Output: torch.Size([1, 768])
```

### Fine-tuning for Classification
```python
from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# Load model for sequence classification
model = RobertaForSequenceClassification.from_pretrained(
    "MWireLabs/mizo-roberta",
    num_labels=3  # e.g., for sentiment: positive, neutral, negative
)

# Load your labeled dataset
# Example: sentiment analysis dataset
dataset = load_dataset("your-dataset-name")

# Tokenize
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
)

# Train
trainer.train()
```

### Batch Processing
```python
# Process multiple sentences efficiently
sentences = [
    "Aizawl hi Mizoram khawpui ber a ni",
    "Mizo tawng hi Mizoram official language a ni",
    "India ram Northeast a Mizoram hi a awm"
]

# Tokenize batch
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)

# Process outputs as needed
```

## Applications

Mizo-RoBERTa can be fine-tuned for various downstream NLP tasks:

- **Text Classification** (sentiment analysis, topic classification, news categorization)
- **Named Entity Recognition** (NER for Mizo entities)
- **Question Answering** (extractive QA systems)
- **Semantic Similarity** (sentence/document similarity)
- **Information Retrieval** (semantic search in Mizo content)
- **Language Understanding** (natural language inference, textual entailment)

## Limitations

- **Dialectal Coverage**: The model may not comprehensively represent all Mizo dialects
- **Domain Balance**: Formal written text may be overrepresented compared to conversational Mizo
- **Pretraining Objective**: Only trained with Masked Language Modeling (MLM); may benefit from additional objectives
- **Context Length**: Limited to 512 tokens; longer documents require chunking
- **Low-resource Constraints**: While large for Mizo, the training corpus is still smaller than high-resource language datasets

## Ethical Considerations

- **Representation**: The model reflects the content and potential biases present in the training corpus
- **Intended Use**: Designed for research and applications that benefit Mizo language speakers
- **Misuse Potential**: Should not be used for generating misleading information or harmful content
- **Data Privacy**: Training data was collected from publicly available sources; no private information was used
- **Cultural Sensitivity**: Users should be aware of cultural context when deploying for Mizo-speaking communities

## Citation

If you use Mizo-RoBERTa in your research or applications, please cite:
```bibtex
@misc{mizoroberta2025,
  title={Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language},
  author={MWireLabs},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/MWireLabs/mizo-roberta}}
}
```

## Related Resources

- **Public Training Data**: [mizo-language-corpus-4M](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M)
- **Sister Model**: [KhasiBERT](https://huggingface.co/MWireLabs/KhasiBERT) - RoBERTa model for Khasi language
- **Organization**: [MWireLabs on HuggingFace](https://huggingface.co/MWireLabs)

## Model Card Contact

For questions, issues, or collaboration opportunities:
- **Organization**: MWireLabs
- **Email**: Contact through HuggingFace
- **Issues**: Report on the model's HuggingFace page

## License

This model is released under the Apache 2.0 License. See LICENSE file for details.

## Acknowledgments

We thank the Mizo language community and content creators whose publicly available work made this model possible. Special thanks to all contributors to the open-source NLP ecosystem, particularly the HuggingFace team for their excellent tools and infrastructure.

---

**MWireLabs** - Building AI for Northeast India 🚀