TrOCR Fine-tuned for Typewritten Text
A fine-tuned version of microsoft/trocr-base-printed optimized for typewritten historical documents, specifically U.S. County deed record indices.
Model Description
- Base Model: microsoft/trocr-base-printed
- License: MIT
- Task: Optical Character Recognition (OCR) for typewritten text
This model was fine-tuned on 8,100 manually annotated images of typewritten index entries from historical deed records. It achieves near-perfect accuracy on this domain, with an exact match rate of 99.88%.
Intended Use
Primary use case: OCR for typewritten documents, particularly historical records with similar formatting to mid-20th century U.S. county indices.
Limitations: This model was trained on a specific document style. Performance on handwritten text, modern fonts, or significantly different layouts has not been evaluated and may be poor.
Usage
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
processor = TrOCRProcessor.from_pretrained("m-mjm/trocr-finetuned-typewritten")
model = VisionEncoderDecoderModel.from_pretrained("m-mjm/trocr-finetuned-typewritten")
# Load image (ensure this is a crop of a text line, not a full page)
image = Image.open("your_image.ext").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values, num_beams=4)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
Training Details
Data Preparation
The training dataset was created from high-resolution scans of typewritten deed record indices. Each entry was segmented and manually transcribed to create ground truth labels.
To prevent overfitting, two augmentation strategies were applied:
- Column reordering: Text segments were cropped and recombined in different arrangements
- Noise augmentation: Applied the same noise transformations used in the original TrOCR training (see noise.py in the TrOCR repository)
Hyperparameters
| Parameter | Value |
|---|---|
| Batch size (per device) | 4 |
| Gradient accumulation steps | 8 |
| Effective batch size | 32 |
| Learning rate | 5e-6 |
| Weight decay | 0.03 |
| Epochs | 10 |
| Warmup steps | 200 |
| Precision | FP16 |
| Generation beams | 4 |
Dataset Split
- Training samples: 7,315
- Evaluation samples: 813
Performance
| Metric | Base Model | Fine-tuned | Improvement |
|---|---|---|---|
| CER | 0.0368 | 0.00004 | 99.9% |
| WER | 0.1371 | 0.0002 | 99.9% |
| Exact Match | — | 0.9988 | — |
Exact match was used as the primary training metric. Standard CER/WER can appear low while still producing unacceptable results for structured data extraction, so exact match provided a more reliable signal during training.
Repository
Training notebook and data preparation scripts: GitHub
Citation
If you use this model, please cite the original TrOCR paper:
@misc{li2021trocr,
title={TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models},
author={Minghao Li and Tengchao Lv and Lei Cui and Yijuan Lu and Dinei Florencio and Cha Zhang and Zhoujun Li and Furu Wei},
year={2021},
eprint={2109.10282},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Model tree for m-mjm/trocr-finetuned-typewritten
Base model
microsoft/trocr-base-printed