Nyishi OCR

OCR model for the Nyishi language of Arunachal Pradesh, India. Developed by MWire Labs as part of the Northeast India OCR initiative.

Model Details

  • Architecture: DocTR ViTSTR-Base (85.3M parameters)
  • Script: Latin
  • Language: Nyishi (Nishi), spoken by ~300,000 people in Arunachal Pradesh
  • License: CC-BY-4.0

Performance

Split Char Accuracy
Validation 94.60%
Test 95.51%

Training Data

Fine-tuned on an 84k unique mix of synthetic and 5k curated images. Synthetic generated images used 21 fonts with augmentation (blur, noise, rotation, brightness variation).

Usage

from doctr.models import vitstr_base
import torch, json

charset = json.load(open("nyishi_charset.json"))["charset"]
model = vitstr_base(pretrained=False, vocab=charset)
model.load_state_dict(torch.load("nyishi_doctr_best.pt"))
model.eval()

Citation

If you use this model, please cite:

@misc{mwirelabs2026nyishiocr,
  title={NyishiOCR: First OCR Model for the Nyishi Language},
  author={MWire Labs},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/MWirelabs/nyishi-ocr}
}

About MWire Labs

MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results