ne-ocr / README.md

Update README.md

44a0808 verified about 9 hours ago

5.3 kB

	---
	license: cc-by-4.0
	language:
	- asm
	- mni
	- kha
	- lus
	- grt
	- trp
	- njz
	- brx
	- nag
	- eng
	- hin
	tags:
	- ocr
	- northeast-india
	- doctr
	- vitstr
	- mizo
	- garo
	- khasi
	- nyishi
	- kokborok
	- nagamese
	- bodo
	- meitei
	---

	<p align="center">
	<img src="https://huggingface.co/MWirelabs/ne-ocr/resolve/main/assets/mwire.png" width="180" alt="MWire Labs Logo">
	</p>

	# NE-OCR
	### High-Accuracy OCR for Northeast Indian Scripts

	[![Technical Report](https://img.shields.io/badge/Technical_Report-PDF-blue)](https://mwirelabs.com/wp-content/uploads/2026/03/NE_OCR_Technical_Report.pdf)
	[![License](https://img.shields.io/badge/License-CC--BY--4.0-green)](https://creativecommons.org/licenses/by/4.0/)
	[![Benchmark](https://img.shields.io/badge/Benchmark-26k_Samples-orange)](#benchmark-test-set)

	Purpose-built OCR for Northeast India with 94.99% average character accuracy across 12 language–script pairs.
	Outperforms EasyOCR, Tesseract 5, and TrOCR-large on 9 of 12 language–script pairs.
	Fast inference and strong performance where general OCR systems fail.

	Developed by MWire Labs, Shillong, Meghalaya.

	<p align="center">
	<img src="https://huggingface.co/MWirelabs/ne-ocr/resolve/main/assets/neocrarchitecture.jpg" width="850" alt="NE-OCR Architecture Diagram">
	</p>

	NE-OCR is built on a ViTSTR-Base encoder with CTC decoding. The model processes 32×128 RGB word/line crops across Latin, Bengali, Devanagari, and Meitei Mayek scripts, outputting text from a 1,056-character multilingual vocabulary.

	## Model Details
	- Architecture: DocTR ViTSTR-Base (86M parameters)
	- Vocab size: 1056 characters (Latin, Bengali, Devanagari, Meitei Mayek)
	- Input: 32×128 RGB image crops (word/line level, ≤32 chars)
	- Training data: ~988k deduplicated samples across 12 languages
	- Trained by: MWire Labs

	## Inference Speed

	Measured on NVIDIA A40 (batch size = 1):

	<p align="center">
	<img src="https://huggingface.co/MWirelabs/ne-ocr/resolve/main/assets/inferenceneocr.png" width="700" alt="NE-OCR Latency Comparison">
	</p>

	- NE-OCR: 17.2 ms/image
	- EasyOCR: 37.2 ms
	- TrOCR-large: 92.1 ms
	- Tesseract 5: 166.1 ms
	- Chandra (VLM): 313 ms

	NE-OCR is:
	- 2× faster than EasyOCR
	- 9× faster than Tesseract
	- 18× faster than VLM-based OCR systems

	## Benchmark Comparison — Character Accuracy (ChA%)

	Evaluated on a fixed 26,000-sample benchmark (2,000 per language–script pair).
	Higher is better.

	\| Language \| Script \| NE-OCR \| EasyOCR \| Tesseract 5 \| TrOCR-large \| Chandra \|
	\|----------\|--------\|------------\|----------\|-------------\|-------------\|----------\|
	\| Assamese \| Bengali \| 97.46% \| 32.25% \| 8.79% \| 0.80% \| 57.83% \|
	\| Bodo \| Devanagari \| 83.38% \| 82.65% \| 64.85% \| 1.85% \| 74.76% \|
	\| English \| Latin \| 90.35% \| 68.91% \| 50.77% \| 88.87% \| 91.30% \|
	\| Garo \| Latin \| 93.52% \| 69.43% \| 69.90% \| 87.83% \| 94.15% \|
	\| Hindi \| Devanagari \| 97.69% \| 49.54% \| 41.48% \| 1.27% \| 85.78% \|
	\| Khasi \| Latin \| 98.85% \| 77.78% \| 80.72% \| 93.22% \| 94.15% \|
	\| Kokborok \| Latin \| 97.59% \| 83.00% \| 78.76% \| 94.58% \| 96.19% \|
	\| Meitei (Bengali) \| Bengali \| 97.09% \| 33.64% \| 7.30% \| 0.55% \| 48.34% \|
	\| Meitei (Mayek) \| Meitei Mayek \| 95.56% \| 2.50% \| 2.24% \| 2.45% \| 2.57% \|
	\| Mizo \| Latin \| 95.96% \| 67.62% \| 68.44% \| 84.58% \| 92.96% \|
	\| Nagamese \| Latin \| 97.91% \| 81.60% \| 78.05% \| 93.46% \| 97.60% \|
	\| Nyishi \| Latin \| 94.50% \| 69.56% \| 69.92% \| 87.23% \| 91.85% \|
	\| Average \| — \| 94.99% \| 59.87% \| 51.77% \| 53.06% \| 77.29% \|

	## Benchmark Test Set

	A public benchmark test set is available in the `benchmark/` folder of this repository for reproducing evaluation results and comparing against other OCR models.

	- Combined: `benchmark/ne_ocr_benchmark.parquet` — 26,000 samples across all 12 languages
	- Per-language: `benchmark/{lang}_test.parquet` — 2,000 samples each
	- Format: Parquet with columns: `image_path`, `text`, `lang`
	- Filter: All samples ≤32 characters (word/line-level crops)

	Results reported in this model card are computed on this exact test set.

	## Usage
	````python
	import torch, json
	import numpy as np
	from PIL import Image
	from huggingface_hub import hf_hub_download
	from doctr.models import vitstr_base

	# Download files
	model_path = hf_hub_download(repo_id='MWirelabs/ne-ocr', filename='ne_ocr_best.pt')
	vocab_path = hf_hub_download(repo_id='MWirelabs/ne-ocr', filename='ne_ocr_vocab.json')

	# Load vocab
	with open(vocab_path, encoding='utf-8') as f:
	vocab_data = json.load(f)
	vocab_str = ''.join(vocab_data['vocab'][1:])

	# Load model
	model = vitstr_base(pretrained=False, vocab=vocab_str)
	model.load_state_dict(torch.load(model_path, map_location='cpu'))
	model.eval()

	# Inference (word/line crop, max 32 chars)
	img = Image.open('your_crop.jpg').convert('RGB').resize((128, 32))
	img_tensor = torch.tensor(np.array(img, dtype=np.float32)/255.0).permute(2,0,1).unsqueeze(0)
	out = model(img_tensor)
	print(out['preds'][0][0])
	````

	## Notes
	- Model is designed for word/line-level crops (≤32 characters), not full pages
	- For full page OCR, use a text detection model first (e.g. DBNet) to extract crops
	- Bodo accuracy is lower due to limited training data; planned improvement in V2

	## License
	CC-BY-4.0 — MWire Labs