| | --- |
| | language: |
| | - grt |
| | license: cc-by-4.0 |
| | tags: |
| | - ocr |
| | - florence-2 |
| | - garo |
| | - northeast-india |
| | - image-to-text |
| | base_model: microsoft/Florence-2-base-ft |
| | metrics: |
| | - character_accuracy |
| | model-index: |
| | - name: MWirelabs/garo-ocr |
| | results: |
| | - task: |
| | type: image-to-text |
| | name: OCR |
| | metrics: |
| | - type: character_accuracy |
| | value: 93.13 |
| | name: Character Accuracy (1000 samples) |
| | --- |
| | |
| | # GaroOCR |
| |
|
| |  |
| |  |
| |
|
| | OCR model for the Garo (grt_Latn) language, fine-tuned from `microsoft/Florence-2-base-ft` on Garo text images. |
| | |
| | Developed by **MWire Labs**, Shillong, Meghalaya; part of an ongoing effort to build foundational AI for Northeast Indian languages. |
| | |
| | --- |
| | |
| | ## Model Details |
| | |
| | | | | |
| | |---|---| |
| | | Base model | `microsoft/Florence-2-base-ft` | |
| | | Parameters | 231M | |
| | | Language | Garo (Achik) | |
| | | Task | OCR (image → text) | |
| | | Training samples | 80,000 | |
| | | Epochs | 5 | |
| | | Character Accuracy | 93.13% | |
| | |
| | --- |
| | |
| | ## Training Setup |
| | |
| | - **Hardware:** NVIDIA A40 (48GB) |
| | - **Precision:** bfloat16 |
| | - **Batch size:** 4 (effective 16 with gradient accumulation) |
| | - **Learning rate:** 3e-4 with cosine scheduler |
| | - **Max label length:** 128 tokens |
| | - **Task prompt:** `<OCR>` (Florence-2 uppercase token) |
| | |
| | --- |
| | |
| | ## Usage |
| | |
| | ```python |
| | from transformers import AutoProcessor, AutoModelForCausalLM |
| | from PIL import Image |
| | import torch |
| | |
| | processor = AutoProcessor.from_pretrained("MWirelabs/garo-ocr", trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "MWirelabs/garo-ocr", |
| | torch_dtype=torch.bfloat16, |
| | trust_remote_code=True, |
| | ).cuda() |
| | |
| | image = Image.open("your_image.png").convert("RGB") |
| | inputs = processor(text="<OCR>", images=image, return_tensors="pt") |
| | inputs = {k: v.cuda() for k, v in inputs.items()} |
| | inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16) |
| |
|
| | with torch.no_grad(): |
| | generated = model.generate( |
| | pixel_values=inputs["pixel_values"], |
| | input_ids=inputs["input_ids"], |
| | max_new_tokens=128, |
| | ) |
| | |
| | text = processor.tokenizer.decode(generated[0], skip_special_tokens=True) |
| | print(text) |
| | ``` |
| | |
| | > **Note:** Use `transformers==4.38.2` for compatibility. |
| | |
| | --- |
| | |
| | ## Limitations |
| | |
| | - Max reliable output length is ~128 tokens |
| | - Part of MWire Labs' mono-language series; a multilingual NE-OCR model covering more Northeast Indian languages is in development |
| | |
| | --- |
| | |