Serialtechlab
/

dhivehi-trocr-base-handwritten

vision-encoder-decoder

Model card Files Files and versions

dhivehi-trocr-base-handwritten / README.md

Serialtechlab's picture

Upload tokenizer

b861262 verified 8 days ago

|

history blame contribute delete

2.06 kB

	---
	language: dv
	license: apache-2.0
	tags:
	- ocr
	- trocr
	- dhivehi
	- maldives
	- thaana
	pipeline_tag: image-to-text
	base_model: microsoft/trocr-base-handwritten
	datasets:
	- alakxender/dhivehi-image-text
	- alakxender/dhivehi-vrd-batch-1-img-questions
	---

	# Dhivehi TrOCR Base V6

	A fine-tuned [TrOCR](https://huggingface.co/microsoft/trocr-base-handwritten) model for Dhivehi (Maldivian) text recognition using Thaana script.

	## Model Details

	- Base model: microsoft/trocr-base-handwritten
	- Parameters: ~334M
	- Training data: ~695K samples (315K dhivehi-image-text + 380K dhivehi-vrd)
	- Best CER: 0.9% (checkpoint-20000)
	- Character tokenizer: WordLevel (character-level) with EOS

	## Usage

	```python
	from transformers import TrOCRProcessor, VisionEncoderDecoderModel, PreTrainedTokenizerFast
	from PIL import Image
	import torch

	processor = TrOCRProcessor.from_pretrained("Serialtechlab/dhivehi-trocr-base-handwritten")
	model = VisionEncoderDecoderModel.from_pretrained("Serialtechlab/dhivehi-trocr-base-handwritten")
	tokenizer = PreTrainedTokenizerFast.from_pretrained("Serialtechlab/dhivehi-trocr-base-handwritten")

	image = Image.open("dhivehi_text.png").convert("RGB")
	pixel_values = processor(image, return_tensors='pt').pixel_values

	with torch.no_grad():
	generated_ids = model.generate(pixel_values, max_length=128, num_beams=4)

	tokens = tokenizer.convert_ids_to_tokens(generated_ids[0])
	special = [tokenizer.pad_token, tokenizer.bos_token, tokenizer.eos_token, tokenizer.unk_token]
	text = "".join([t for t in tokens if t not in special])
	print(text)
	```

	## Training

	Trained from scratch on Google Colab (A100) for 6 epochs with:

	- Learning rate: 4e-5
	- Batch size: 16
	- EOS token appended to all labels
	- Proper PAD token masking (-100)
	- Character-level WordLevel tokenizer

	## Limitations

	- Optimized for single text line images (use a text detector like Surya for full pages)
	- May truncate very long lines (max_length=128 characters)
	- Best results on printed Dhivehi text; handwritten accuracy varies by style