🧾🍩 Receipt Donut β€” Complete Document for Understanding

Welcome! This page explains every technical decision so you can understand (and replicate) the full training pipeline.

This model extracts structured JSON data directly from receipt images without needing a separate OCR engine. It is a fine-tuned version of naver-clova-ix/donut-base-finetuned-cord-v2 trained on 8,615 real-world receipt images.

Try it live: πŸš€ Hugging Face Space


πŸ“‹ Table of Contents

  1. What is Ground Truth?
  2. Training Configuration (YAML Deep Dive)
  3. Dataset & Train/Test/Val Split
  4. Training Performance & Learning Curves
  5. Confusion Matrix & Field-Level Evaluation
  6. How to Use (Python)
  7. Model Architecture
  8. Limitations

What is Ground Truth?

In machine learning, Ground Truth is the "correct answer" we teach the model to predict. For receipts, instead of raw OCR text, we use structured JSON so the model learns to output clean, labeled data.

Example Ground Truth

{
  "merchant": "Starbucks Coffee",
  "date": "2026-03-15",
  "subtotal": "$12.50",
  "tax": "$1.13",
  "total": "$13.63",
  "address": "123 Main St, New York, NY"
}

Why JSON Ground Truth matters

Approach Problem Our Solution
Raw OCR text No structure β€” you get "Starbucks $13.63" We label keys and values
Fixed template Fails on receipts with different fields JSON is flexible and self-describing
Named Entity Recognition Requires post-processing pipeline Donut outputs JSON directly

How we normalized different datasets

Receipt datasets use wildly different formats. We wrote _normalize_gt() to unify them:

# WildReceipts uses a list of annotations:
annotations = [
  {"label": "store_name", "transcription": "Walmart"},
  {"label": "total_value", "transcription": "$45.20"}
]

# CORD uses nested JSON:
gt_parse = {
  "menu": [...],
  "total": {"price": "$45.20"}
}

# Our code converts ALL of these into a single normalized format:
{
  "merchant": "Walmart",
  "total": "$45.20"
}

We skip samples with empty ground truth to prevent the model from learning to output {}.


Training Configuration (YAML Deep Dive)

Here is the exact gcp_l4_enterprise.yaml we used. Each parameter is explained so you understand why we chose it.

model:
  model_name: "naver-clova-ix/donut-base-finetuned-cord-v2"
  max_length: 768
  image_size: [1536, 1152]  # Wider than tall for typical receipts

training:
  output_dir: "./outputs/receipt_donut_gcp_enterprise"
  num_train_epochs: 20       # Upper limit; early stopping at epoch 9
  batch_size: 4              # Fits in L4 24GB VRAM
  gradient_accumulation_steps: 16  # Effective batch = 4 Γ— 16 = 64
  learning_rate: 8.0e-5      # Higher LR for larger effective batch
  weight_decay: 0.01         # Prevents overfitting
  warmup_ratio: 0.05         # 5% of steps warm up LR from 0
  bf16: true                 # L4 GPU has native BFloat16 support
  gradient_checkpointing: true  # Trade compute for memory; enables larger batches
  label_smoothing: 0.1       # Softens targets; prevents overconfident predictions
  freeze_encoder_epochs: 1   # Train only decoder first (faster convergence)
  cosine_restart_epochs: 5   # LR schedule restarts every 5 epochs
  grayscale: true            # Reduces domain gap between color/gray receipts
  num_workers: 8             # Parallel data loading (L4 has 8 CPU cores)

data:
  dataset_root: "./receipt_datasets"
  train_split: 0.95          # 95% training
  val_split: 0.025           # 2.5% validation
  test_split: 0.025          # 2.5% holdout test
  seed: 42
  include_datasets:
    - "Voxel51__scanned_receipts"
    - "naver-clova-ix__cord-v2"
    - "docjay131__receipts-ocr-dataset"
    - "mychen76__invoices-and-receipts_ocr_v1"
    - "mychen76__invoices-and-receipts_ocr_v2"
    - "mychen76__wildreceipts_ocr_v1"
    - "mychen76__receipt_cord_ocr_v2"
    - "mychen76__ds_receipts_v2_train"

augmentation:
  enabled: true
  rotation_limit: 20         # Simulates tilted camera photos
  brightness_limit: 0.3      # Different lighting conditions
  contrast_limit: 0.3
  blur_prob: 0.5             # Camera shake / focus blur
  noise_prob: 0.5            # ISO noise in dark restaurants
  perspective_prob: 0.6      # Receipts photographed at an angle
  quality_lower: 40          # JPEG compression artifacts
  quality_upper: 100

Key Concepts Explained

Gradient Accumulation: We process 4 images at a time, but accumulate gradients over 16 steps before updating weights. This gives us the stability of batch size 64 without needing 64Γ— the GPU memory.

BFloat16 (bf16): A half-precision number format. The L4 GPU has native bf16 hardware, so training is ~2Γ— faster and uses ~half the memory compared to fp32, with almost no accuracy loss.

Gradient Checkpointing: Instead of storing all intermediate activations in memory, we recompute them during backward pass. This lets us fit a bigger model/batch at the cost of ~20% slower training.

Label Smoothing: Normally the model is told "this token is 100% correct." With smoothing, we say "this token is 90% correct, others share the remaining 10%." This prevents the model from becoming overconfident.


Dataset & Train/Test/Val Split

Data Sources (8 Datasets, ~8,615 labeled samples)

Dataset Type Approx. Samples Notes
CORD-v2 Structured ~800 Clean, high-quality receipts
WildReceipts List annotations ~2,000 Noisy real-world scans
Scanned Receipts Image + OCR ~1,000 Voxel51 collection
Invoices & Receipts v1/v2 Mixed ~2,500 mychen76 datasets
Receipt CORD OCR v2 OCR pairs ~1,000 Double-escaped JSON (we fixed parsing)
DS Receipts v2 Train Synthetic ~1,000 Also had double-escaped strings

Split Ratios

Total: 8,615 samples
β”œβ”€β”€ Train:     8,184  (95%)
β”œβ”€β”€ Val:         215  (2.5%)  β†’ Used to pick the best checkpoint
└── Test:        215  (2.5%)  β†’ Holdout set, never seen during training

We used a single unified dataset loader (UnifiedReceiptDataset) so all 8 datasets are mixed and shuffled together. This prevents the model from overfitting to any one receipt style.

Why these splits?

  • 95% train: With <10k samples, we need as much training data as possible.
  • 2.5% val: Just enough to detect overfitting without wasting data.
  • 2.5% test: Final unbiased evaluation. In practice, we also evaluated visually on unseen real receipts.

Training Performance & Learning Curves

Loss Curve

Learning Curve

The model converged around Epoch 9. Training was stopped early because:

  • Validation loss plateaued
  • No improvement for 3 consecutive epochs
  • Further training risked overfitting

Key Metrics

Metric Value
Total training samples 8,615
Effective batch size 64
Peak learning rate 8.0e-5
Training precision bf16
GPU NVIDIA L4 (24 GB VRAM)
Training duration ~10 hours actual (+ ~12 hours trial/error)
Early stopping epoch 9 / 20

Sample Visual Results

Below are real model outputs on the validation set (Original Image vs. Predicted JSON).

Sample 1 Example 1: Correctly extracted merchant, date, and total.

Sample 2 Example 2: Handled a partially blurred receipt with minor date typo.

Sample 3 Example 3: Multi-line address and tax amount correctly parsed.


Confusion Matrix & Field-Level Evaluation

Since this is a generative text model (not a classifier), a traditional confusion matrix doesn't apply. Instead, we evaluate each extracted field with a Field-Level Confusion Matrix based on string similarity.

Evaluation Categories

Category Criteria Example
βœ… Correct 100% character match $13.63 == $13.63
⚠️ Minor Typo < 20% Levenshtein distance Starbuks vs Starbucks
❌ Incorrect > 20% distance or missing null vs Walmart

Field-Level Confusion Matrix (Test Set β€” 597 Samples)

Field Correct Minor Typo Incorrect Notes
merchant 70.9% (423/597) 8.5% (51) 20.6% (123) Store names vary wildly in format
date 86.9% (519/597) 1.0% (6) 12.1% (72) Highly consistent format
subtotal 71.7% (428/597) 2.3% (14) 26.0% (155) Often missing on simple receipts
tax 86.4% (516/597) 0.0% (0) 13.6% (81) Usually present when subtotal is
total 47.4% (283/597) 7.9% (47) 44.7% (267) Hardest field β€” model confuses it with subtotal
address 100.0% (597/597) 0.0% (0) 0.0% (0) Test set has 0 address labels β€” model correctly abstains

Field Confusion Matrix

Overall Performance

Exact Match (all fields correct): 32.8% (196/597)
Usable Match (≀1 minor typo):    61.1% (365/597)
Any Incorrect Field:             38.9% (232/597)

Key insight 1: The total field is the model's biggest weakness at 47.4% correct. This is because total and subtotal are visually similar numbers on receipts, and the model sometimes swaps them. Improving this would require stronger positional cues or a post-processing rule (always pick the larger number).

Key insight 2: address at 100% is not meaningful β€” address labels are completely absent from the 5 test datasets (CORD, WildReceipts, etc. don't include address). The model correctly learned not to hallucinate it.

Why is Exact Match only 32.8%? Receipt OCR is genuinely hard. The test datasets (CORD, WildReceipts, etc.) use different JSON schemas and raw output formats. The model learns normalized fields, but raw GT contains keys like total_price, cashprice, changeprice that don't align perfectly. The model is still useful β€” 61.1% of receipts are "usable" with at most one small typo.

Generating the Confusion Matrix Yourself

Run this on your Workbench to reproduce the evaluation:

python scripts/evaluate_model.py \
  --model_path outputs/receipt_donut_gcp_enterprise/best_model \
  --dataset_root receipt_datasets \
  --output_dir evaluation_results

This outputs:

  • confusion_matrix.png β€” Visual matrix per field
  • field_accuracy.json β€” Numerical breakdown
  • error_analysis.html β€” Side-by-side failures

How to Use (Python)

Installation

pip install transformers Pillow torch

Single Image Inference

import torch
from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image

MODEL = "Awarebeyond/receipt-donut"
processor = DonutProcessor.from_pretrained(MODEL)
model = VisionEncoderDecoderModel.from_pretrained(MODEL)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device).eval()

def extract(image_path):
    img = Image.open(image_path).convert("RGB")
    pixel_values = processor(img, return_tensors="pt").pixel_values.to(device)
    decoder_input_ids = torch.tensor([[model.config.decoder_start_token_id]]).to(device)

    with torch.no_grad():
        outputs = model.generate(
            pixel_values,
            decoder_input_ids=decoder_input_ids,
            max_length=512,
            pad_token_id=processor.tokenizer.pad_token_id,
            eos_token_id=processor.tokenizer.eos_token_id,
            use_cache=True,
            bad_words_ids=[[processor.tokenizer.unk_token_id]],
        )

    seq = processor.tokenizer.batch_decode(outputs.sequences)[0]
    seq = seq.replace(processor.tokenizer.eos_token, "").replace(
        processor.tokenizer.pad_token, ""
    )
    seq = seq.replace(
        processor.tokenizer.decode([model.config.decoder_start_token_id]), ""
    ).strip()

    return json.loads(seq)

result = extract("my_receipt.jpg")
print(json.dumps(result, indent=2))

Batch Inference

from glob import glob

receipts = glob("receipts/*.jpg")
results = [extract(r) for r in receipts]

# Save to JSON
with open("batch_results.json", "w") as f:
    json.dump(results, f, indent=2)

Model Architecture

Input Image (1536Γ—1152)
    ↓
Swin Transformer Encoder
    ↓
Encoder Hidden States
    ↓
BART Decoder (cross-attention)
    ↓
JSON Text Tokens
  • Encoder: Swin Transformer (hierarchical vision backbone)
  • Decoder: BART (text generation with cross-attention)
  • Vocabulary: ~5,000 tokens (includes special receipt tokens)
  • Parameters: ~300M total

Why Donut?

Feature OCR + NER Pipeline Donut (End-to-End)
Errors compound OCR error β†’ NER fails Single model, single optimization
Layout handling Requires separate layout model Built into vision encoder
Speed Multi-stage, slower One forward pass
Maintenance 3+ models to update One model, one checkpoint

Limitations

  1. Resolution: Works best on receipts with text height β‰₯ 10 pixels. Very low-res images may fail.
  2. Languages: Primarily trained on English receipts. Other languages may produce lower accuracy.
  3. Handwriting: Printed text works best. Cursive handwriting is not well supported.
  4. Field coverage: Only extracts merchant, date, subtotal, tax, total, address. Line items are not extracted.
  5. Currency normalization: Outputs raw strings ($13.63) β€” post-processing may be needed to convert to floats.

Citation

If you use this model in research, please cite:

@misc{receipt_donut_2026,
  title={Receipt Donut: Fine-tuned Document Understanding for Receipt Extraction},
  author={Awarebeyond},
  year={2026},
  howpublished={\url{https://huggingface.co/Awarebeyond/receipt-donut}}
}

Built with ❀️ by a NAVTTC πŸ‡΅πŸ‡° student using Google Cloud Workbench (L4 GPU) and the Hugging Face ecosystem.

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
I64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train Awarebeyond/receipt-donut

Space using Awarebeyond/receipt-donut 1