Serialtechlab's picture
Upload tokenizer
b861262 verified
---
language: dv
license: apache-2.0
tags:
- ocr
- trocr
- dhivehi
- maldives
- thaana
pipeline_tag: image-to-text
base_model: microsoft/trocr-base-handwritten
datasets:
- alakxender/dhivehi-image-text
- alakxender/dhivehi-vrd-batch-1-img-questions
---
# Dhivehi TrOCR Base V6
A fine-tuned [TrOCR](https://huggingface.co/microsoft/trocr-base-handwritten) model for Dhivehi (Maldivian) text recognition using Thaana script.
## Model Details
- **Base model:** microsoft/trocr-base-handwritten
- **Parameters:** ~334M
- **Training data:** ~695K samples (315K dhivehi-image-text + 380K dhivehi-vrd)
- **Best CER:** 0.9% (checkpoint-20000)
- **Character tokenizer:** WordLevel (character-level) with EOS
## Usage
```python
from transformers import TrOCRProcessor, VisionEncoderDecoderModel, PreTrainedTokenizerFast
from PIL import Image
import torch
processor = TrOCRProcessor.from_pretrained("Serialtechlab/dhivehi-trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("Serialtechlab/dhivehi-trocr-base-handwritten")
tokenizer = PreTrainedTokenizerFast.from_pretrained("Serialtechlab/dhivehi-trocr-base-handwritten")
image = Image.open("dhivehi_text.png").convert("RGB")
pixel_values = processor(image, return_tensors='pt').pixel_values
with torch.no_grad():
generated_ids = model.generate(pixel_values, max_length=128, num_beams=4)
tokens = tokenizer.convert_ids_to_tokens(generated_ids[0])
special = [tokenizer.pad_token, tokenizer.bos_token, tokenizer.eos_token, tokenizer.unk_token]
text = "".join([t for t in tokens if t not in special])
print(text)
```
## Training
Trained from scratch on Google Colab (A100) for 6 epochs with:
- Learning rate: 4e-5
- Batch size: 16
- EOS token appended to all labels
- Proper PAD token masking (-100)
- Character-level WordLevel tokenizer
## Limitations
- Optimized for single text line images (use a text detector like Surya for full pages)
- May truncate very long lines (max_length=128 characters)
- Best results on printed Dhivehi text; handwritten accuracy varies by style