mizo-ocr / README.md
Khrawsynth's picture
Update README.md
6729721 verified
---
license: cc-by-4.0
language:
- lus
tags:
- ocr
- mizo
- northeast-india
- trocr
- image-to-text
- low-resource
model_name: mizo-ocr
base_model: microsoft/trocr-base-printed
---
# MizoOCR
The first OCR model for the Mizo language, developed by [MWire Labs](https://huggingface.co/MWirelabs).
## Model Description
MizoOCR is a fine-tuned TrOCR model for recognizing printed Mizo text, including its unique diacritical characters (芒, 锚, 卯, 么, 没). It is built on `microsoft/trocr-base-printed` and trained on 70,000 deduplicated mix of curated + synthetic image-text pairs drawn from a 200k dataset generated by MWire Labs.
## Performance
| Split | Character Accuracy |
|-------|-------------------|
| Validation | 89.61% |
| Test | 90.68% |
## Training Data
- **Total unique samples after deduplication:** 102,171
- **Training samples:** 70,000
- **Validation samples:** 5,000
- **Test samples:** 5,000
## Usage
```python
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
processor = TrOCRProcessor.from_pretrained("MWirelabs/mizo-ocr")
model = VisionEncoderDecoderModel.from_pretrained("MWirelabs/mizo-ocr")
image = Image.open("mizo_text.jpg").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
generated = model.generate(pixel_values)
text = processor.tokenizer.decode(generated[0], skip_special_tokens=True)
print(text)
```
## Limitations
- Trained primarily on synthetic data with a small curated dataset; accuracy on real scanned documents may vary
- Optimized for printed text, not handwritten
- Performance may vary on heavily degraded or low-quality images
## Citation
If you use this model, please cite:
```
@misc{mwirelabs2026mizoocr,
title={MizoOCR: First OCR Model for the Mizo Language},
author={MWire Labs},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/MWirelabs/mizo-ocr}
}
```
## About MWire Labs
MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.