nyishi-ocr / README.md
Khrawsynth's picture
Update README.md
c4dae07 verified
metadata
language:
  - en
  - njz
tags:
  - ocr
  - doctr
  - vitstr
  - northeast-india
  - arunachal-pradesh
  - nyishi
  - low-resource
  - image-text-to-text
license: cc-by-4.0
library_name: doctr
metrics:
  - cer
model-index:
  - name: NYishiOCR
    results:
      - task:
          type: image-text-to-text
          name: Optical Character Recognition
        metrics:
          - type: char_accuracy
            value: 95.51
            name: Test Char Accuracy
          - type: char_accuracy
            value: 94.6
            name: Val Char Accuracy

Nyishi OCR

OCR model for the Nyishi language of Arunachal Pradesh, India. Developed by MWire Labs as part of the Northeast India OCR initiative.

Model Details

  • Architecture: DocTR ViTSTR-Base (85.3M parameters)
  • Script: Latin
  • Language: Nyishi (Nishi), spoken by ~300,000 people in Arunachal Pradesh
  • License: CC-BY-4.0

Performance

Split Char Accuracy
Validation 94.60%
Test 95.51%

Training Data

Fine-tuned on an 84k unique mix of synthetic and 5k curated images. Synthetic generated images used 21 fonts with augmentation (blur, noise, rotation, brightness variation).

Usage

from doctr.models import vitstr_base
import torch, json

charset = json.load(open("nyishi_charset.json"))["charset"]
model = vitstr_base(pretrained=False, vocab=charset)
model.load_state_dict(torch.load("nyishi_doctr_best.pt"))
model.eval()

Citation

If you use this model, please cite:

@misc{mwirelabs2026nyishiocr,
  title={NyishiOCR: First OCR Model for the Nyishi Language},
  author={MWire Labs},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/MWirelabs/nyishi-ocr}
}

About MWire Labs

MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.