ne-ocr / README.md

Update README.md

44a0808 verified about 7 hours ago

5.3 kB

license: cc-by-4.0
language:
  - asm
  - mni
  - kha
  - lus
  - grt
  - trp
  - njz
  - brx
  - nag
  - eng
  - hin
tags:
  - ocr
  - northeast-india
  - doctr
  - vitstr
  - mizo
  - garo
  - khasi
  - nyishi
  - kokborok
  - nagamese
  - bodo
  - meitei

MWire Labs Logo

NE-OCR

High-Accuracy OCR for Northeast Indian Scripts

Purpose-built OCR for Northeast India with 94.99% average character accuracy across 12 language–script pairs.
Outperforms EasyOCR, Tesseract 5, and TrOCR-large on 9 of 12 language–script pairs.
Fast inference and strong performance where general OCR systems fail.

Developed by MWire Labs, Shillong, Meghalaya.

NE-OCR Architecture Diagram

NE-OCR is built on a ViTSTR-Base encoder with CTC decoding. The model processes 32×128 RGB word/line crops across Latin, Bengali, Devanagari, and Meitei Mayek scripts, outputting text from a 1,056-character multilingual vocabulary.

Model Details

Architecture: DocTR ViTSTR-Base (86M parameters)
Vocab size: 1056 characters (Latin, Bengali, Devanagari, Meitei Mayek)
Input: 32×128 RGB image crops (word/line level, ≤32 chars)
Training data: ~988k deduplicated samples across 12 languages
Trained by: MWire Labs

Inference Speed

Measured on NVIDIA A40 (batch size = 1):

NE-OCR Latency Comparison

NE-OCR: 17.2 ms/image
EasyOCR: 37.2 ms
TrOCR-large: 92.1 ms
Tesseract 5: 166.1 ms
Chandra (VLM): 313 ms

NE-OCR is:

2× faster than EasyOCR
9× faster than Tesseract
18× faster than VLM-based OCR systems

Benchmark Comparison — Character Accuracy (ChA%)

Evaluated on a fixed 26,000-sample benchmark (2,000 per language–script pair).
Higher is better.

Language	Script	NE-OCR	EasyOCR	Tesseract 5	TrOCR-large	Chandra
Assamese	Bengali	97.46%	32.25%	8.79%	0.80%	57.83%
Bodo	Devanagari	83.38%	82.65%	64.85%	1.85%	74.76%
English	Latin	90.35%	68.91%	50.77%	88.87%	91.30%
Garo	Latin	93.52%	69.43%	69.90%	87.83%	94.15%
Hindi	Devanagari	97.69%	49.54%	41.48%	1.27%	85.78%
Khasi	Latin	98.85%	77.78%	80.72%	93.22%	94.15%
Kokborok	Latin	97.59%	83.00%	78.76%	94.58%	96.19%
Meitei (Bengali)	Bengali	97.09%	33.64%	7.30%	0.55%	48.34%
Meitei (Mayek)	Meitei Mayek	95.56%	2.50%	2.24%	2.45%	2.57%
Mizo	Latin	95.96%	67.62%	68.44%	84.58%	92.96%
Nagamese	Latin	97.91%	81.60%	78.05%	93.46%	97.60%
Nyishi	Latin	94.50%	69.56%	69.92%	87.23%	91.85%
Average	—	94.99%	59.87%	51.77%	53.06%	77.29%

Benchmark Test Set

A public benchmark test set is available in the benchmark/ folder of this repository for reproducing evaluation results and comparing against other OCR models.

Combined: benchmark/ne_ocr_benchmark.parquet — 26,000 samples across all 12 languages
Per-language: benchmark/{lang}_test.parquet — 2,000 samples each
Format: Parquet with columns: image_path, text, lang
Filter: All samples ≤32 characters (word/line-level crops)

Results reported in this model card are computed on this exact test set.

Usage

import torch, json
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
from doctr.models import vitstr_base

# Download files
model_path = hf_hub_download(repo_id='MWirelabs/ne-ocr', filename='ne_ocr_best.pt')
vocab_path = hf_hub_download(repo_id='MWirelabs/ne-ocr', filename='ne_ocr_vocab.json')

# Load vocab
with open(vocab_path, encoding='utf-8') as f:
    vocab_data = json.load(f)
vocab_str = ''.join(vocab_data['vocab'][1:])

# Load model
model = vitstr_base(pretrained=False, vocab=vocab_str)
model.load_state_dict(torch.load(model_path, map_location='cpu'))
model.eval()

# Inference (word/line crop, max 32 chars)
img = Image.open('your_crop.jpg').convert('RGB').resize((128, 32))
img_tensor = torch.tensor(np.array(img, dtype=np.float32)/255.0).permute(2,0,1).unsqueeze(0)
out = model(img_tensor)
print(out['preds'][0][0])

Notes

Model is designed for word/line-level crops (≤32 characters), not full pages
For full page OCR, use a text detection model first (e.g. DBNet) to extract crops
Bodo accuracy is lower due to limited training data; planned improvement in V2

License

CC-BY-4.0 — MWire Labs