garo-ocr / README.md
Badnyal's picture
Update README.md
0c10117 verified
metadata
language:
  - grt
license: cc-by-4.0
tags:
  - ocr
  - florence-2
  - garo
  - northeast-india
  - image-to-text
base_model: microsoft/Florence-2-base-ft
metrics:
  - character_accuracy
model-index:
  - name: MWirelabs/garo-ocr
    results:
      - task:
          type: image-to-text
          name: OCR
        metrics:
          - type: character_accuracy
            value: 93.13
            name: Character Accuracy (1000 samples)

GaroOCR

License: CC BY 4.0 Character Accuracy

OCR model for the Garo (grt_Latn) language, fine-tuned from microsoft/Florence-2-base-ft on Garo text images.

Developed by MWire Labs, Shillong, Meghalaya; part of an ongoing effort to build foundational AI for Northeast Indian languages.


Model Details

Base model microsoft/Florence-2-base-ft
Parameters 231M
Language Garo (Achik)
Task OCR (image → text)
Training samples 80,000
Epochs 5
Character Accuracy 93.13%

Training Setup

  • Hardware: NVIDIA A40 (48GB)
  • Precision: bfloat16
  • Batch size: 4 (effective 16 with gradient accumulation)
  • Learning rate: 3e-4 with cosine scheduler
  • Max label length: 128 tokens
  • Task prompt: <OCR> (Florence-2 uppercase token)

Usage

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("MWirelabs/garo-ocr", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "MWirelabs/garo-ocr",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).cuda()

image = Image.open("your_image.png").convert("RGB")
inputs = processor(text="<OCR>", images=image, return_tensors="pt")
inputs = {k: v.cuda() for k, v in inputs.items()}
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

with torch.no_grad():
    generated = model.generate(
        pixel_values=inputs["pixel_values"],
        input_ids=inputs["input_ids"],
        max_new_tokens=128,
    )

text = processor.tokenizer.decode(generated[0], skip_special_tokens=True)
print(text)

Note: Use transformers==4.38.2 for compatibility.


Limitations

  • Max reliable output length is ~128 tokens
  • Part of MWire Labs' mono-language series; a multilingual NE-OCR model covering more Northeast Indian languages is in development