TrOCR Fine-tuned for Typewritten Text

A fine-tuned version of microsoft/trocr-base-printed optimized for typewritten historical documents, specifically U.S. County deed record indices.

Model Description

Base Model: microsoft/trocr-base-printed
License: MIT
Task: Optical Character Recognition (OCR) for typewritten text

This model was fine-tuned on 8,100 manually annotated images of typewritten index entries from historical deed records. It achieves near-perfect accuracy on this domain, with an exact match rate of 99.88%.

Intended Use

Primary use case: OCR for typewritten documents, particularly historical records with similar formatting to mid-20th century U.S. county indices.

Limitations: This model was trained on a specific document style. Performance on handwritten text, modern fonts, or significantly different layouts has not been evaluated and may be poor.

Usage

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained("m-mjm/trocr-finetuned-typewritten")
model = VisionEncoderDecoderModel.from_pretrained("m-mjm/trocr-finetuned-typewritten")

# Load image (ensure this is a crop of a text line, not a full page)
image = Image.open("your_image.ext").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values, num_beams=4)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(generated_text)

Training Details

Data Preparation

The training dataset was created from high-resolution scans of typewritten deed record indices. Each entry was segmented and manually transcribed to create ground truth labels.

To prevent overfitting, two augmentation strategies were applied:

Column reordering: Text segments were cropped and recombined in different arrangements
Noise augmentation: Applied the same noise transformations used in the original TrOCR training (see noise.py in the TrOCR repository)

Hyperparameters

Parameter	Value
Batch size (per device)	4
Gradient accumulation steps	8
Effective batch size	32
Learning rate	5e-6
Weight decay	0.03
Epochs	10
Warmup steps	200
Precision	FP16
Generation beams	4

Dataset Split

Training samples: 7,315
Evaluation samples: 813

Performance

Metric	Base Model	Fine-tuned	Improvement
CER	0.0368	0.00004	99.9%
WER	0.1371	0.0002	99.9%
Exact Match	—	0.9988	—

Exact match was used as the primary training metric. Standard CER/WER can appear low while still producing unacceptable results for structured data extraction, so exact match provided a more reliable signal during training.

Repository

Training notebook and data preparation scripts: GitHub

Citation

If you use this model, please cite the original TrOCR paper:

@misc{li2021trocr,
  title={TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models},
  author={Minghao Li and Tengchao Lv and Lei Cui and Yijuan Lu and Dinei Florencio and Cha Zhang and Zhoujun Li and Furu Wei},
  year={2021},
  eprint={2109.10282},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for m-mjm/trocr-finetuned-typewritten

Base model

microsoft/trocr-base-printed

Finetuned

(23)

this model

Paper for m-mjm/trocr-finetuned-typewritten

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

Paper • 2109.10282 • Published Sep 21, 2021 • 13