MWirelabs
/

garo-ocr

northeast-india

Eval Results (legacy)

Model card Files Files and versions

garo-ocr / README.md

Badnyal's picture

Update README.md

0c10117 verified 8 days ago

|

history blame contribute delete

2.53 kB

	---
	language:
	- grt
	license: cc-by-4.0
	tags:
	- ocr
	- florence-2
	- garo
	- northeast-india
	- image-to-text
	base_model: microsoft/Florence-2-base-ft
	metrics:
	- character_accuracy
	model-index:
	- name: MWirelabs/garo-ocr
	results:
	- task:
	type: image-to-text
	name: OCR
	metrics:
	- type: character_accuracy
	value: 93.13
	name: Character Accuracy (1000 samples)
	---

	# GaroOCR

	![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)
	![Character Accuracy](https://img.shields.io/badge/Char%20Accuracy-93.13%25-brightgreen)

	OCR model for the Garo (grt_Latn) language, fine-tuned from `microsoft/Florence-2-base-ft` on Garo text images.

	Developed by MWire Labs, Shillong, Meghalaya; part of an ongoing effort to build foundational AI for Northeast Indian languages.

	---

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Base model \| `microsoft/Florence-2-base-ft` \|
	\| Parameters \| 231M \|
	\| Language \| Garo (Achik) \|
	\| Task \| OCR (image → text) \|
	\| Training samples \| 80,000 \|
	\| Epochs \| 5 \|
	\| Character Accuracy \| 93.13% \|

	---

	## Training Setup

	- Hardware: NVIDIA A40 (48GB)
	- Precision: bfloat16
	- Batch size: 4 (effective 16 with gradient accumulation)
	- Learning rate: 3e-4 with cosine scheduler
	- Max label length: 128 tokens
	- Task prompt: `<OCR>` (Florence-2 uppercase token)

	---

	## Usage

	```python
	from transformers import AutoProcessor, AutoModelForCausalLM
	from PIL import Image
	import torch

	processor = AutoProcessor.from_pretrained("MWirelabs/garo-ocr", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	"MWirelabs/garo-ocr",
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	).cuda()

	image = Image.open("your_image.png").convert("RGB")
	inputs = processor(text="<OCR>", images=image, return_tensors="pt")
	inputs = {k: v.cuda() for k, v in inputs.items()}
	inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

	with torch.no_grad():
	generated = model.generate(
	pixel_values=inputs["pixel_values"],
	input_ids=inputs["input_ids"],
	max_new_tokens=128,
	)

	text = processor.tokenizer.decode(generated[0], skip_special_tokens=True)
	print(text)
	```

	> Note: Use `transformers==4.38.2` for compatibility.

	---

	## Limitations

	- Max reliable output length is ~128 tokens
	- Part of MWire Labs' mono-language series; a multilingual NE-OCR model covering more Northeast Indian languages is in development

	---