metadata
language:
- en
- njz
tags:
- ocr
- doctr
- vitstr
- northeast-india
- arunachal-pradesh
- nyishi
- low-resource
- image-text-to-text
license: cc-by-4.0
library_name: doctr
metrics:
- cer
model-index:
- name: NYishiOCR
results:
- task:
type: image-text-to-text
name: Optical Character Recognition
metrics:
- type: char_accuracy
value: 95.51
name: Test Char Accuracy
- type: char_accuracy
value: 94.6
name: Val Char Accuracy
Nyishi OCR
OCR model for the Nyishi language of Arunachal Pradesh, India. Developed by MWire Labs as part of the Northeast India OCR initiative.
Model Details
- Architecture: DocTR ViTSTR-Base (85.3M parameters)
- Script: Latin
- Language: Nyishi (Nishi), spoken by ~300,000 people in Arunachal Pradesh
- License: CC-BY-4.0
Performance
| Split | Char Accuracy |
|---|---|
| Validation | 94.60% |
| Test | 95.51% |
Training Data
Fine-tuned on an 84k unique mix of synthetic and 5k curated images. Synthetic generated images used 21 fonts with augmentation (blur, noise, rotation, brightness variation).
Usage
from doctr.models import vitstr_base
import torch, json
charset = json.load(open("nyishi_charset.json"))["charset"]
model = vitstr_base(pretrained=False, vocab=charset)
model.load_state_dict(torch.load("nyishi_doctr_best.pt"))
model.eval()
Citation
If you use this model, please cite:
@misc{mwirelabs2026nyishiocr,
title={NyishiOCR: First OCR Model for the Nyishi Language},
author={MWire Labs},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/MWirelabs/nyishi-ocr}
}
About MWire Labs
MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.