TrOCR Fine-tuned for Typewritten Text

A fine-tuned version of microsoft/trocr-base-printed optimized for typewritten historical documents, specifically U.S. County deed record indices.

Model Description

  • Base Model: microsoft/trocr-base-printed
  • License: MIT
  • Task: Optical Character Recognition (OCR) for typewritten text

This model was fine-tuned on 8,100 manually annotated images of typewritten index entries from historical deed records. It achieves near-perfect accuracy on this domain, with an exact match rate of 99.88%.

Intended Use

Primary use case: OCR for typewritten documents, particularly historical records with similar formatting to mid-20th century U.S. county indices.

Limitations: This model was trained on a specific document style. Performance on handwritten text, modern fonts, or significantly different layouts has not been evaluated and may be poor.

Segment Example

Usage

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained("m-mjm/trocr-finetuned-typewritten")
model = VisionEncoderDecoderModel.from_pretrained("m-mjm/trocr-finetuned-typewritten")

# Load image (ensure this is a crop of a text line, not a full page)
image = Image.open("your_image.ext").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values, num_beams=4)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(generated_text)

Training Details

Data Preparation

The training dataset was created from high-resolution scans of typewritten deed record indices. Each entry was segmented and manually transcribed to create ground truth labels.

To prevent overfitting, two augmentation strategies were applied:

  1. Column reordering: Text segments were cropped and recombined in different arrangements
  2. Noise augmentation: Applied the same noise transformations used in the original TrOCR training (see noise.py in the TrOCR repository)

Hyperparameters

Parameter Value
Batch size (per device) 4
Gradient accumulation steps 8
Effective batch size 32
Learning rate 5e-6
Weight decay 0.03
Epochs 10
Warmup steps 200
Precision FP16
Generation beams 4

Dataset Split

  • Training samples: 7,315
  • Evaluation samples: 813

Performance

Metric Base Model Fine-tuned Improvement
CER 0.0368 0.00004 99.9%
WER 0.1371 0.0002 99.9%
Exact Match — 0.9988 —

Exact match was used as the primary training metric. Standard CER/WER can appear low while still producing unacceptable results for structured data extraction, so exact match provided a more reliable signal during training.

Repository

Training notebook and data preparation scripts: GitHub

Citation

If you use this model, please cite the original TrOCR paper:

@misc{li2021trocr,
  title={TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models},
  author={Minghao Li and Tengchao Lv and Lei Cui and Yijuan Lu and Dinei Florencio and Cha Zhang and Zhoujun Li and Furu Wei},
  year={2021},
  eprint={2109.10282},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for m-mjm/trocr-finetuned-typewritten

Finetuned
(20)
this model

Paper for m-mjm/trocr-finetuned-typewritten