Model Card for Model ID

In this work, we study and train a model based on the Encoder-Decoder architecture (Donut) to generate structured text from handwritten text images. This approach extends beyond conventional Optical Character Recognition (OCR) by enabling the model to not only transcribe the text but also identify structural elements such as titles, subtitles, paragraphs, and equations. To facilitate training, a synthetic image generator was developed, capable of accurately simulating handwritten text using datasets such as IAM and CHROME. Furthermore, data augmentation techniques—including random rotation, Gaussian noise, and brightness and contrast adjustments—were employed to enhance the model’s generalization capabilities. As a result, the model outperformed alternatives such as Nougat in terms of Levenshtein distance, demonstrating improved transcription quality and structural understanding.

Model Details

Model Description

The most recent OCR models are based on complex architectures capable of generating text from images using an encoder-decoder framework, which leverages Transformers to process both image and text embeddings, thereby capturing intricate relationships \cite{Donut, layoutlmv3, nougat, swinVILT}. Embeddings are data vectors that typically represent words or tokens and can be processed by the Transformer, enhancing its ability to capture semantic relationships \cite{Pytorch2024-Embeddings}. However, to achieve the same with images, the input is segmented using a grid (patches), treating each cell as if it were a word. Each cell must then be flattened through Flatten layers to be treated as a one-dimensional vector \cite{vit}. This approach enables a more effective representation and learning process over images by exploiting the attention mechanisms within a Transformer \cite{Transformers}.

Despite the existence of OCR models with high accuracy, the transcription of handwritten text and the detection of entities such as titles, subtitles, formatting, handwritten equations, among others, remain a significant challenge for current models, including both commercial and open-source solutions \cite{TrOCR, googleOCR}. The motivation behind this study lies in the absence of a sufficiently reliable comprehensive solution that, through a single computer vision model equipped with the cutting-edge Transformer architecture \cite{Transformers}, is capable of recognizing the majority of entities within a handwritten document. Such a solution would greatly enhance and streamline the digitization of these documents. This study also aims to pave the way for future research to analyze the behavior of advanced computer vision models and develop novel solutions through Computing and Mathematics.

This work proposes the fine-tuning and evaluation of a multimodal image-to-text model based on the Transformer architecture, designed for the recognition and transcription of handwritten text into LaTeX format. The model should be capable of identifying document layout components, including titles, subtitles, paragraphs, tables, equations, and the textual content itself. Equations will also be processed and converted into LaTeX format. Given the clear need for data to train the model, the study considers the use of a synthetic image generator that simulates handwritten text with sufficient variability and generality, while also including entities such as equations and tables.

  • Developed by: Pablo Navarrete
  • Model type: Image-to-Text
  • Language(s) (NLP): English
  • License: Creative Commons Attribution 4.0 (you may use it for commercial purposes, but you must give credit to the author)
  • Finetuned from model: Donut Base

Model Sources

  • Repository: [More Information Needed]
  • Paper [optional]: [More Information Needed]
  • Demo [optional]: [More Information Needed]

Uses

Direct Use

This model is designed to generate structured text (e.g., paragraphs, titles, subtitles, and equations) from handwritten text images. It leverages the Donut architecture (Swin Transformer encoder + BART decoder) for document understanding and transcription. Its direct applications include:

  • Digitization of handwritten academic documents.
  • Conversion of handwritten notes to LaTeX-style structured text.
  • Extraction of semantic content from images of handwritten material for accessibility and storage.

Intended users are researchers and developers in the fields of computer vision and natural language processing, especially those interested in Optical Character Recognition (OCR) and document structure analysis.

Downstream Use

When fine-tuned or integrated into larger systems, the model can be used for:

  • Smart archival tools for handwritten historical or educational materials.
  • Educational apps that convert handwritten math/science content into digital form.
  • Assisting visually impaired users by converting handwritten notes into readable structured text via screen readers.

Out-of-Scope Use

This model is not suitable for:

  • Recognition of highly stylized handwriting or non-standard writing scripts not represented in the training data.
  • Security-critical applications (e.g., automated check processing, legal document signing).
  • Real-time OCR in resource-constrained edge devices due to its heavy model size and GPU dependence.

Bias, Risks, and Limitations

The model is trained primarily on synthetically generated handwritten data and augmented samples. As such:

  • It may underperform with actual handwritten data that deviates significantly from training patterns (e.g., culturally diverse handwriting).
  • It may fail to generalize to low-light, noisy, or blurry images not represented in synthetic datasets.
  • Structural interpretation (e.g., correct identification of titles vs. subtitles) is prompt-dependent and not guaranteed to be perfect.

Recommendations

Users should:

  • Avoid deploying the model in safety-critical or highly regulated settings without additional fine-tuning and validation.
  • Consider augmenting training with real-world handwritten datasets to improve generalization.
  • Post-process the model outputs to detect and correct potential structural or transcription errors.

How to Get Started with the Model

Use the code below to get started with the model. To get more accuracy, it's recommendable that you use images with 640x480 with a title, two subtitle and two paragraphs with equations.

from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests
# load image from the IAM database
url = 'https://cdn-uploads.huggingface.co/production/uploads/65bc102dc1a44b6ef18be34b/65VBkKutGZ0WoERpfD8uw.jpeg'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
processor = DonutProcessor.from_pretrained("pabloOmega/donut_hw_entities_v2")
model = VisionEncoderDecoderModel.from_pretrained("pabloOmega/donut_hw_entities_v2")
pixel_values = processor(image, return_tensors="pt").pixel_values

# To recognize the title use this prompt:
prompt = f"<s_hw><s_generate_title>"
decoder_input_ids = processor.tokenizer(prompt, add_special_tokens=False).input_ids
decoder_input_ids = torch.tensor(decoder_input_ids)

outputs = model.generate(pixel_values,
                            decoder_input_ids=decoder_input_ids.unsqueeze(0),
                            max_length=255,
                            pad_token_id=processor.tokenizer.pad_token_id,
                            eos_token_id=processor.tokenizer.convert_tokens_to_ids(f"</s_title>"),
                            use_cache=True,
                            num_beams=1,
                            bad_words_ids=[[processor.tokenizer.unk_token_id]],
                            return_dict_in_generate=True,
                            repetition_penalty=1.2,
                        )
predictions = []
for seq in processor.tokenizer.batch_decode(outputs.sequences):
    seq = seq.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    # seq = re.sub(r"<.*?>", "", seq, count=1).strip()  # remove first task start token
    predictions.append(seq)

predictions = predictions[0]
print(predictions)

# To recognize the subtitle use this prompt:
prompt = f"<s_hw><s_generate_subtitles>"
decoder_input_ids = processor.tokenizer(prompt, add_special_tokens=False).input_ids
decoder_input_ids = torch.tensor(decoder_input_ids)

outputs = model.generate(pixel_values,
                            decoder_input_ids=decoder_input_ids.unsqueeze(0),
                            max_length=255,
                            pad_token_id=processor.tokenizer.pad_token_id,
                            eos_token_id=processor.tokenizer.convert_tokens_to_ids(f"</s_subtitle>"),
                            use_cache=True,
                            num_beams=1,
                            bad_words_ids=[[processor.tokenizer.unk_token_id]],
                            return_dict_in_generate=True,
                            repetition_penalty=1.2,
                        )
predictions = []
for seq in processor.tokenizer.batch_decode(outputs.sequences):
    seq = seq.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    # seq = re.sub(r"<.*?>", "", seq, count=1).strip()  # remove first task start token
    predictions.append(seq)

predictions = predictions[0]
print(predictions)

# To recognize the paragraph use this prompt:
prompt = f"<s_hw><s_generate_paragraphs>"
decoder_input_ids = processor.tokenizer(prompt, add_special_tokens=False).input_ids
decoder_input_ids = torch.tensor(decoder_input_ids)

outputs = model.generate(pixel_values,
                            decoder_input_ids=decoder_input_ids.unsqueeze(0),
                            max_length=255,
                            pad_token_id=processor.tokenizer.pad_token_id,
                            eos_token_id=processor.tokenizer.convert_tokens_to_ids(f"</s_paragraph>"),
                            use_cache=True,
                            num_beams=1,
                            bad_words_ids=[[processor.tokenizer.unk_token_id]],
                            return_dict_in_generate=True,
                            repetition_penalty=1.2,
                        )
predictions = []
for seq in processor.tokenizer.batch_decode(outputs.sequences):
    seq = seq.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    # seq = re.sub(r"<.*?>", "", seq, count=1).strip()  # remove first task start token
    predictions.append(seq)

predictions = predictions[0]
print(predictions)

Training Details

Training Data

The training data consists of synthetically generated handwritten images simulating paragraphs, titles, subtitles, and equations. These were constructed using:

  • IAM dataset (handwriting stroke coordinates).
  • Equation generator from CHROME 2014 Dataset.
  • Custom generators for layout variation.

You could find the training dataset here.

Training Procedure

If you want to fine-tuning the model, you may want to review this notebook.

Preprocessing

Images were preprocessed with padding and normalization. Text was tokenized into structured prompts representing document elements. You just need to load the processor.

Training Hyperparameters

  • Learning rate: 3e-5
  • Batch size: 4
  • Optimizer: AdamW
  • Epochs: 14
  • Hardware: Google Colab VM with NVIDIA A100 GPU, 90GB CPU RAM.

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Synthetic handwritten documents.
  • Real handwritten images from modified IAM dataset.

Factors

Evaluation differentiated between models capable of processing printed vs. handwritten content, and the ability to recognize structured elements.

Metrics

  • BLEU Score: Measures overlap between generated and reference tokens.
  • Levenshtein Distance: Measures edit distance between generated and true sequences.
Model BLEU Levenshtein Inference Time (s)
Model 0.02 0.51 19.91
Nougat 0.08 0.68 17.75
TrOCR 0.00 1.00 0.85

Results

image/jpeg

image/png

Technical Specifications

Model Architecture and Objective

  • Encoder: Swin Transformer
  • Decoder: BART
  • Objective: Structured text generation from handwritten images using OCR and prompt-based decoding.

Citation

BibTeX:

@mastersthesis{navarrete2025donutocr, author = {Pablo Steve Navarrete Arroyo}, title = {Structured Text Generation from Handwritten Documents using Donut Architecture}, school = {Universidad Internacional de la Rioja}, year = {2025} }

APA:

Navarrete Arroyo, P. S. (2025). Structured Text Generation from Handwritten Documents using Donut Architecture (Master’s thesis). Máster en Ingeniería Matemática y Computación. Universidad Internacional de la Rioja.

Model Card Contact

E-mail

Downloads last month
1
Safetensors
Model size
0.2B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pabloOmega/donut_hw_entities_v2

Finetuned
(477)
this model

Dataset used to train pabloOmega/donut_hw_entities_v2