receipt-donut / README.md
Awarebeyond's picture
Update title with receipt+donut emoji
bc74c47 verified
---
language:
- en
tags:
- image-to-text
- document-ai
- donut
- receipt-extraction
- ocr-free
datasets:
- Voxel51/scanned_receipts
- naver-clova-ix/cord-v2
- docjay131/receipts-ocr-dataset
- mychen76/invoices-and-receipts_ocr_v1
- mychen76/invoices-and-receipts_ocr_v2
- mychen76/wildreceipts_ocr_v1
- mychen76/receipt_cord_ocr_v2
- mychen76/ds_receipts_v2_train
pipeline_tag: image-to-text
widget:
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/receipt.jpg
example_title: Sample Receipt
---
# 🧾🍩 Receipt Donut β€” Complete Document for Understanding
> **Welcome!** This page explains every technical decision so you can understand (and replicate) the full training pipeline.
This model extracts structured JSON data directly from receipt images **without** needing a separate OCR engine. It is a fine-tuned version of `naver-clova-ix/donut-base-finetuned-cord-v2` trained on 8,615 real-world receipt images.
**Try it live:** [πŸš€ Hugging Face Space](https://huggingface.co/spaces/Awarebeyond/receipt-donut-space)
---
## πŸ“‹ Table of Contents
1. [What is Ground Truth?](#what-is-ground-truth)
2. [Training Configuration (YAML Deep Dive)](#training-configuration-yaml-deep-dive)
3. [Dataset & Train/Test/Val Split](#dataset--traintestval-split)
4. [Training Performance & Learning Curves](#training-performance--learning-curves)
5. [Confusion Matrix & Field-Level Evaluation](#confusion-matrix--field-level-evaluation)
6. [How to Use (Python)](#how-to-use-python)
7. [Model Architecture](#model-architecture)
8. [Limitations](#limitations)
---
## What is Ground Truth?
In machine learning, **Ground Truth** is the "correct answer" we teach the model to predict. For receipts, instead of raw OCR text, we use **structured JSON** so the model learns to output clean, labeled data.
### Example Ground Truth
```json
{
"merchant": "Starbucks Coffee",
"date": "2026-03-15",
"subtotal": "$12.50",
"tax": "$1.13",
"total": "$13.63",
"address": "123 Main St, New York, NY"
}
```
### Why JSON Ground Truth matters
| Approach | Problem | Our Solution |
|----------|---------|--------------|
| Raw OCR text | No structure β€” you get "Starbucks $13.63" | We label **keys** and **values** |
| Fixed template | Fails on receipts with different fields | JSON is flexible and self-describing |
| Named Entity Recognition | Requires post-processing pipeline | Donut outputs JSON **directly** |
### How we normalized different datasets
Receipt datasets use wildly different formats. We wrote `_normalize_gt()` to unify them:
```python
# WildReceipts uses a list of annotations:
annotations = [
{"label": "store_name", "transcription": "Walmart"},
{"label": "total_value", "transcription": "$45.20"}
]
# CORD uses nested JSON:
gt_parse = {
"menu": [...],
"total": {"price": "$45.20"}
}
# Our code converts ALL of these into a single normalized format:
{
"merchant": "Walmart",
"total": "$45.20"
}
```
We **skip samples with empty ground truth** to prevent the model from learning to output `{}`.
---
## Training Configuration (YAML Deep Dive)
Here is the exact `gcp_l4_enterprise.yaml` we used. Each parameter is explained so you understand **why** we chose it.
```yaml
model:
model_name: "naver-clova-ix/donut-base-finetuned-cord-v2"
max_length: 768
image_size: [1536, 1152] # Wider than tall for typical receipts
training:
output_dir: "./outputs/receipt_donut_gcp_enterprise"
num_train_epochs: 20 # Upper limit; early stopping at epoch 9
batch_size: 4 # Fits in L4 24GB VRAM
gradient_accumulation_steps: 16 # Effective batch = 4 Γ— 16 = 64
learning_rate: 8.0e-5 # Higher LR for larger effective batch
weight_decay: 0.01 # Prevents overfitting
warmup_ratio: 0.05 # 5% of steps warm up LR from 0
bf16: true # L4 GPU has native BFloat16 support
gradient_checkpointing: true # Trade compute for memory; enables larger batches
label_smoothing: 0.1 # Softens targets; prevents overconfident predictions
freeze_encoder_epochs: 1 # Train only decoder first (faster convergence)
cosine_restart_epochs: 5 # LR schedule restarts every 5 epochs
grayscale: true # Reduces domain gap between color/gray receipts
num_workers: 8 # Parallel data loading (L4 has 8 CPU cores)
data:
dataset_root: "./receipt_datasets"
train_split: 0.95 # 95% training
val_split: 0.025 # 2.5% validation
test_split: 0.025 # 2.5% holdout test
seed: 42
include_datasets:
- "Voxel51__scanned_receipts"
- "naver-clova-ix__cord-v2"
- "docjay131__receipts-ocr-dataset"
- "mychen76__invoices-and-receipts_ocr_v1"
- "mychen76__invoices-and-receipts_ocr_v2"
- "mychen76__wildreceipts_ocr_v1"
- "mychen76__receipt_cord_ocr_v2"
- "mychen76__ds_receipts_v2_train"
augmentation:
enabled: true
rotation_limit: 20 # Simulates tilted camera photos
brightness_limit: 0.3 # Different lighting conditions
contrast_limit: 0.3
blur_prob: 0.5 # Camera shake / focus blur
noise_prob: 0.5 # ISO noise in dark restaurants
perspective_prob: 0.6 # Receipts photographed at an angle
quality_lower: 40 # JPEG compression artifacts
quality_upper: 100
```
### Key Concepts Explained
**Gradient Accumulation:** We process 4 images at a time, but accumulate gradients over 16 steps before updating weights. This gives us the stability of batch size 64 without needing 64Γ— the GPU memory.
**BFloat16 (bf16):** A half-precision number format. The L4 GPU has native bf16 hardware, so training is ~2Γ— faster and uses ~half the memory compared to fp32, with almost no accuracy loss.
**Gradient Checkpointing:** Instead of storing all intermediate activations in memory, we recompute them during backward pass. This lets us fit a bigger model/batch at the cost of ~20% slower training.
**Label Smoothing:** Normally the model is told "this token is 100% correct." With smoothing, we say "this token is 90% correct, others share the remaining 10%." This prevents the model from becoming overconfident.
---
## Dataset & Train/Test/Val Split
### Data Sources (8 Datasets, ~8,615 labeled samples)
| Dataset | Type | Approx. Samples | Notes |
|---------|------|-----------------|-------|
| CORD-v2 | Structured | ~800 | Clean, high-quality receipts |
| WildReceipts | List annotations | ~2,000 | Noisy real-world scans |
| Scanned Receipts | Image + OCR | ~1,000 | Voxel51 collection |
| Invoices & Receipts v1/v2 | Mixed | ~2,500 | mychen76 datasets |
| Receipt CORD OCR v2 | OCR pairs | ~1,000 | Double-escaped JSON (we fixed parsing) |
| DS Receipts v2 Train | Synthetic | ~1,000 | Also had double-escaped strings |
### Split Ratios
```
Total: 8,615 samples
β”œβ”€β”€ Train: 8,184 (95%)
β”œβ”€β”€ Val: 215 (2.5%) β†’ Used to pick the best checkpoint
└── Test: 215 (2.5%) β†’ Holdout set, never seen during training
```
We used a **single unified dataset loader** (`UnifiedReceiptDataset`) so all 8 datasets are mixed and shuffled together. This prevents the model from overfitting to any one receipt style.
### Why these splits?
- **95% train:** With <10k samples, we need as much training data as possible.
- **2.5% val:** Just enough to detect overfitting without wasting data.
- **2.5% test:** Final unbiased evaluation. In practice, we also evaluated visually on unseen real receipts.
---
## Training Performance & Learning Curves
### Loss Curve
![Learning Curve](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/learning_curve.png)
The model converged around **Epoch 9**. Training was stopped early because:
- Validation loss plateaued
- No improvement for 3 consecutive epochs
- Further training risked overfitting
### Key Metrics
| Metric | Value |
|--------|-------|
| Total training samples | 8,615 |
| Effective batch size | 64 |
| Peak learning rate | 8.0e-5 |
| Training precision | bf16 |
| GPU | NVIDIA L4 (24 GB VRAM) |
| Training duration | ~10 hours actual (+ ~12 hours trial/error) |
| Early stopping epoch | 9 / 20 |
### Sample Visual Results
Below are real model outputs on the validation set (Original Image vs. Predicted JSON).
![Sample 1](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_0.png)
*Example 1: Correctly extracted merchant, date, and total.*
![Sample 2](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_1.png)
*Example 2: Handled a partially blurred receipt with minor date typo.*
![Sample 3](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_2.png)
*Example 3: Multi-line address and tax amount correctly parsed.*
---
## Confusion Matrix & Field-Level Evaluation
Since this is a **generative text model** (not a classifier), a traditional confusion matrix doesn't apply. Instead, we evaluate each extracted field with a **Field-Level Confusion Matrix** based on string similarity.
### Evaluation Categories
| Category | Criteria | Example |
|----------|----------|---------|
| βœ… **Correct** | 100% character match | `$13.63` == `$13.63` |
| ⚠️ **Minor Typo** | < 20% Levenshtein distance | `Starbuks` vs `Starbucks` |
| ❌ **Incorrect** | > 20% distance or missing | `null` vs `Walmart` |
### Field-Level Confusion Matrix (Test Set β€” 597 Samples)
| Field | Correct | Minor Typo | Incorrect | Notes |
|-------|---------|------------|-----------|-------|
| `merchant` | **70.9%** (423/597) | 8.5% (51) | 20.6% (123) | Store names vary wildly in format |
| `date` | **86.9%** (519/597) | 1.0% (6) | 12.1% (72) | Highly consistent format |
| `subtotal` | **71.7%** (428/597) | 2.3% (14) | 26.0% (155) | Often missing on simple receipts |
| `tax` | **86.4%** (516/597) | 0.0% (0) | 13.6% (81) | Usually present when subtotal is |
| `total` | **47.4%** (283/597) | 7.9% (47) | 44.7% (267) | **Hardest field** β€” model confuses it with subtotal |
| `address` | **100.0%** (597/597) | 0.0% (0) | 0.0% (0) | **Test set has 0 address labels** β€” model correctly abstains |
![Field Confusion Matrix](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/field_confusion_matrix.png)
### Overall Performance
```
Exact Match (all fields correct): 32.8% (196/597)
Usable Match (≀1 minor typo): 61.1% (365/597)
Any Incorrect Field: 38.9% (232/597)
```
> **Key insight 1:** The `total` field is the model's biggest weakness at 47.4% correct. This is because `total` and `subtotal` are visually similar numbers on receipts, and the model sometimes swaps them. Improving this would require stronger positional cues or a post-processing rule (always pick the larger number).
> **Key insight 2:** `address` at 100% is **not meaningful** β€” address labels are completely absent from the 5 test datasets (CORD, WildReceipts, etc. don't include address). The model correctly learned not to hallucinate it.
> **Why is Exact Match only 32.8%?** Receipt OCR is genuinely hard. The test datasets (CORD, WildReceipts, etc.) use different JSON schemas and raw output formats. The model learns normalized fields, but raw GT contains keys like `total_price`, `cashprice`, `changeprice` that don't align perfectly. The model is still useful β€” **61.1%** of receipts are "usable" with at most one small typo.
### Generating the Confusion Matrix Yourself
Run this on your Workbench to reproduce the evaluation:
```bash
python scripts/evaluate_model.py \
--model_path outputs/receipt_donut_gcp_enterprise/best_model \
--dataset_root receipt_datasets \
--output_dir evaluation_results
```
This outputs:
- `confusion_matrix.png` β€” Visual matrix per field
- `field_accuracy.json` β€” Numerical breakdown
- `error_analysis.html` β€” Side-by-side failures
---
## How to Use (Python)
### Installation
```bash
pip install transformers Pillow torch
```
### Single Image Inference
```python
import torch
from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
MODEL = "Awarebeyond/receipt-donut"
processor = DonutProcessor.from_pretrained(MODEL)
model = VisionEncoderDecoderModel.from_pretrained(MODEL)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device).eval()
def extract(image_path):
img = Image.open(image_path).convert("RGB")
pixel_values = processor(img, return_tensors="pt").pixel_values.to(device)
decoder_input_ids = torch.tensor([[model.config.decoder_start_token_id]]).to(device)
with torch.no_grad():
outputs = model.generate(
pixel_values,
decoder_input_ids=decoder_input_ids,
max_length=512,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=True,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
)
seq = processor.tokenizer.batch_decode(outputs.sequences)[0]
seq = seq.replace(processor.tokenizer.eos_token, "").replace(
processor.tokenizer.pad_token, ""
)
seq = seq.replace(
processor.tokenizer.decode([model.config.decoder_start_token_id]), ""
).strip()
return json.loads(seq)
result = extract("my_receipt.jpg")
print(json.dumps(result, indent=2))
```
### Batch Inference
```python
from glob import glob
receipts = glob("receipts/*.jpg")
results = [extract(r) for r in receipts]
# Save to JSON
with open("batch_results.json", "w") as f:
json.dump(results, f, indent=2)
```
---
## Model Architecture
```
Input Image (1536Γ—1152)
↓
Swin Transformer Encoder
↓
Encoder Hidden States
↓
BART Decoder (cross-attention)
↓
JSON Text Tokens
```
- **Encoder:** Swin Transformer (hierarchical vision backbone)
- **Decoder:** BART (text generation with cross-attention)
- **Vocabulary:** ~5,000 tokens (includes special receipt tokens)
- **Parameters:** ~300M total
### Why Donut?
| Feature | OCR + NER Pipeline | Donut (End-to-End) |
|---------|-------------------|-------------------|
| Errors compound | OCR error β†’ NER fails | Single model, single optimization |
| Layout handling | Requires separate layout model | Built into vision encoder |
| Speed | Multi-stage, slower | One forward pass |
| Maintenance | 3+ models to update | One model, one checkpoint |
---
## Limitations
1. **Resolution:** Works best on receipts with text height β‰₯ 10 pixels. Very low-res images may fail.
2. **Languages:** Primarily trained on English receipts. Other languages may produce lower accuracy.
3. **Handwriting:** Printed text works best. Cursive handwriting is not well supported.
4. **Field coverage:** Only extracts `merchant`, `date`, `subtotal`, `tax`, `total`, `address`. Line items are not extracted.
5. **Currency normalization:** Outputs raw strings (`$13.63`) β€” post-processing may be needed to convert to floats.
---
## Citation
If you use this model in research, please cite:
```bibtex
@misc{receipt_donut_2026,
title={Receipt Donut: Fine-tuned Document Understanding for Receipt Extraction},
author={Awarebeyond},
year={2026},
howpublished={\url{https://huggingface.co/Awarebeyond/receipt-donut}}
}
```
---
*Built with ❀️ by a NAVTTC πŸ‡΅πŸ‡° student using Google Cloud Workbench (L4 GPU) and the Hugging Face ecosystem.*