Nyishi OCR
OCR model for the Nyishi language of Arunachal Pradesh, India. Developed by MWire Labs as part of the Northeast India OCR initiative.
Model Details
- Architecture: DocTR ViTSTR-Base (85.3M parameters)
- Script: Latin
- Language: Nyishi (Nishi), spoken by ~300,000 people in Arunachal Pradesh
- License: CC-BY-4.0
Performance
| Split | Char Accuracy |
|---|---|
| Validation | 94.60% |
| Test | 95.51% |
Training Data
Fine-tuned on an 84k unique mix of synthetic and 5k curated images. Synthetic generated images used 21 fonts with augmentation (blur, noise, rotation, brightness variation).
Usage
from doctr.models import vitstr_base
import torch, json
charset = json.load(open("nyishi_charset.json"))["charset"]
model = vitstr_base(pretrained=False, vocab=charset)
model.load_state_dict(torch.load("nyishi_doctr_best.pt"))
model.eval()
Citation
If you use this model, please cite:
@misc{mwirelabs2026nyishiocr,
title={NyishiOCR: First OCR Model for the Nyishi Language},
author={MWire Labs},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/MWirelabs/nyishi-ocr}
}
About MWire Labs
MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.
- Downloads last month
- 10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Evaluation results
- Test Char Accuracyself-reported95.510
- Val Char Accuracyself-reported94.600