File size: 15,387 Bytes
1a4ded0 3c91027 763f78a 3c91027 d92d333 1a4ded0 bc74c47 1a4ded0 d1b593c 1a4ded0 763f78a d92d333 763f78a d92d333 763f78a 3520ef6 763f78a 3928714 763f78a 3520ef6 763f78a d92d333 3928714 763f78a 3928714 763f78a 3928714 763f78a d1b593c 763f78a d1b593c 8625d3b d1b593c 3928714 763f78a d1b593c 763f78a 8625d3b d1b593c 8625d3b d92d333 763f78a 3c91027 763f78a 3c91027 763f78a 3c91027 763f78a 3c91027 763f78a 3c91027 763f78a 3c91027 763f78a 3c91027 763f78a 3520ef6 763f78a 3520ef6 763f78a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 | ---
language:
- en
tags:
- image-to-text
- document-ai
- donut
- receipt-extraction
- ocr-free
datasets:
- Voxel51/scanned_receipts
- naver-clova-ix/cord-v2
- docjay131/receipts-ocr-dataset
- mychen76/invoices-and-receipts_ocr_v1
- mychen76/invoices-and-receipts_ocr_v2
- mychen76/wildreceipts_ocr_v1
- mychen76/receipt_cord_ocr_v2
- mychen76/ds_receipts_v2_train
pipeline_tag: image-to-text
widget:
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/receipt.jpg
example_title: Sample Receipt
---
# π§Ύπ© Receipt Donut β Complete Document for Understanding
> **Welcome!** This page explains every technical decision so you can understand (and replicate) the full training pipeline.
This model extracts structured JSON data directly from receipt images **without** needing a separate OCR engine. It is a fine-tuned version of `naver-clova-ix/donut-base-finetuned-cord-v2` trained on 8,615 real-world receipt images.
**Try it live:** [π Hugging Face Space](https://huggingface.co/spaces/Awarebeyond/receipt-donut-space)
---
## π Table of Contents
1. [What is Ground Truth?](#what-is-ground-truth)
2. [Training Configuration (YAML Deep Dive)](#training-configuration-yaml-deep-dive)
3. [Dataset & Train/Test/Val Split](#dataset--traintestval-split)
4. [Training Performance & Learning Curves](#training-performance--learning-curves)
5. [Confusion Matrix & Field-Level Evaluation](#confusion-matrix--field-level-evaluation)
6. [How to Use (Python)](#how-to-use-python)
7. [Model Architecture](#model-architecture)
8. [Limitations](#limitations)
---
## What is Ground Truth?
In machine learning, **Ground Truth** is the "correct answer" we teach the model to predict. For receipts, instead of raw OCR text, we use **structured JSON** so the model learns to output clean, labeled data.
### Example Ground Truth
```json
{
"merchant": "Starbucks Coffee",
"date": "2026-03-15",
"subtotal": "$12.50",
"tax": "$1.13",
"total": "$13.63",
"address": "123 Main St, New York, NY"
}
```
### Why JSON Ground Truth matters
| Approach | Problem | Our Solution |
|----------|---------|--------------|
| Raw OCR text | No structure β you get "Starbucks $13.63" | We label **keys** and **values** |
| Fixed template | Fails on receipts with different fields | JSON is flexible and self-describing |
| Named Entity Recognition | Requires post-processing pipeline | Donut outputs JSON **directly** |
### How we normalized different datasets
Receipt datasets use wildly different formats. We wrote `_normalize_gt()` to unify them:
```python
# WildReceipts uses a list of annotations:
annotations = [
{"label": "store_name", "transcription": "Walmart"},
{"label": "total_value", "transcription": "$45.20"}
]
# CORD uses nested JSON:
gt_parse = {
"menu": [...],
"total": {"price": "$45.20"}
}
# Our code converts ALL of these into a single normalized format:
{
"merchant": "Walmart",
"total": "$45.20"
}
```
We **skip samples with empty ground truth** to prevent the model from learning to output `{}`.
---
## Training Configuration (YAML Deep Dive)
Here is the exact `gcp_l4_enterprise.yaml` we used. Each parameter is explained so you understand **why** we chose it.
```yaml
model:
model_name: "naver-clova-ix/donut-base-finetuned-cord-v2"
max_length: 768
image_size: [1536, 1152] # Wider than tall for typical receipts
training:
output_dir: "./outputs/receipt_donut_gcp_enterprise"
num_train_epochs: 20 # Upper limit; early stopping at epoch 9
batch_size: 4 # Fits in L4 24GB VRAM
gradient_accumulation_steps: 16 # Effective batch = 4 Γ 16 = 64
learning_rate: 8.0e-5 # Higher LR for larger effective batch
weight_decay: 0.01 # Prevents overfitting
warmup_ratio: 0.05 # 5% of steps warm up LR from 0
bf16: true # L4 GPU has native BFloat16 support
gradient_checkpointing: true # Trade compute for memory; enables larger batches
label_smoothing: 0.1 # Softens targets; prevents overconfident predictions
freeze_encoder_epochs: 1 # Train only decoder first (faster convergence)
cosine_restart_epochs: 5 # LR schedule restarts every 5 epochs
grayscale: true # Reduces domain gap between color/gray receipts
num_workers: 8 # Parallel data loading (L4 has 8 CPU cores)
data:
dataset_root: "./receipt_datasets"
train_split: 0.95 # 95% training
val_split: 0.025 # 2.5% validation
test_split: 0.025 # 2.5% holdout test
seed: 42
include_datasets:
- "Voxel51__scanned_receipts"
- "naver-clova-ix__cord-v2"
- "docjay131__receipts-ocr-dataset"
- "mychen76__invoices-and-receipts_ocr_v1"
- "mychen76__invoices-and-receipts_ocr_v2"
- "mychen76__wildreceipts_ocr_v1"
- "mychen76__receipt_cord_ocr_v2"
- "mychen76__ds_receipts_v2_train"
augmentation:
enabled: true
rotation_limit: 20 # Simulates tilted camera photos
brightness_limit: 0.3 # Different lighting conditions
contrast_limit: 0.3
blur_prob: 0.5 # Camera shake / focus blur
noise_prob: 0.5 # ISO noise in dark restaurants
perspective_prob: 0.6 # Receipts photographed at an angle
quality_lower: 40 # JPEG compression artifacts
quality_upper: 100
```
### Key Concepts Explained
**Gradient Accumulation:** We process 4 images at a time, but accumulate gradients over 16 steps before updating weights. This gives us the stability of batch size 64 without needing 64Γ the GPU memory.
**BFloat16 (bf16):** A half-precision number format. The L4 GPU has native bf16 hardware, so training is ~2Γ faster and uses ~half the memory compared to fp32, with almost no accuracy loss.
**Gradient Checkpointing:** Instead of storing all intermediate activations in memory, we recompute them during backward pass. This lets us fit a bigger model/batch at the cost of ~20% slower training.
**Label Smoothing:** Normally the model is told "this token is 100% correct." With smoothing, we say "this token is 90% correct, others share the remaining 10%." This prevents the model from becoming overconfident.
---
## Dataset & Train/Test/Val Split
### Data Sources (8 Datasets, ~8,615 labeled samples)
| Dataset | Type | Approx. Samples | Notes |
|---------|------|-----------------|-------|
| CORD-v2 | Structured | ~800 | Clean, high-quality receipts |
| WildReceipts | List annotations | ~2,000 | Noisy real-world scans |
| Scanned Receipts | Image + OCR | ~1,000 | Voxel51 collection |
| Invoices & Receipts v1/v2 | Mixed | ~2,500 | mychen76 datasets |
| Receipt CORD OCR v2 | OCR pairs | ~1,000 | Double-escaped JSON (we fixed parsing) |
| DS Receipts v2 Train | Synthetic | ~1,000 | Also had double-escaped strings |
### Split Ratios
```
Total: 8,615 samples
βββ Train: 8,184 (95%)
βββ Val: 215 (2.5%) β Used to pick the best checkpoint
βββ Test: 215 (2.5%) β Holdout set, never seen during training
```
We used a **single unified dataset loader** (`UnifiedReceiptDataset`) so all 8 datasets are mixed and shuffled together. This prevents the model from overfitting to any one receipt style.
### Why these splits?
- **95% train:** With <10k samples, we need as much training data as possible.
- **2.5% val:** Just enough to detect overfitting without wasting data.
- **2.5% test:** Final unbiased evaluation. In practice, we also evaluated visually on unseen real receipts.
---
## Training Performance & Learning Curves
### Loss Curve

The model converged around **Epoch 9**. Training was stopped early because:
- Validation loss plateaued
- No improvement for 3 consecutive epochs
- Further training risked overfitting
### Key Metrics
| Metric | Value |
|--------|-------|
| Total training samples | 8,615 |
| Effective batch size | 64 |
| Peak learning rate | 8.0e-5 |
| Training precision | bf16 |
| GPU | NVIDIA L4 (24 GB VRAM) |
| Training duration | ~10 hours actual (+ ~12 hours trial/error) |
| Early stopping epoch | 9 / 20 |
### Sample Visual Results
Below are real model outputs on the validation set (Original Image vs. Predicted JSON).

*Example 1: Correctly extracted merchant, date, and total.*

*Example 2: Handled a partially blurred receipt with minor date typo.*

*Example 3: Multi-line address and tax amount correctly parsed.*
---
## Confusion Matrix & Field-Level Evaluation
Since this is a **generative text model** (not a classifier), a traditional confusion matrix doesn't apply. Instead, we evaluate each extracted field with a **Field-Level Confusion Matrix** based on string similarity.
### Evaluation Categories
| Category | Criteria | Example |
|----------|----------|---------|
| β
**Correct** | 100% character match | `$13.63` == `$13.63` |
| β οΈ **Minor Typo** | < 20% Levenshtein distance | `Starbuks` vs `Starbucks` |
| β **Incorrect** | > 20% distance or missing | `null` vs `Walmart` |
### Field-Level Confusion Matrix (Test Set β 597 Samples)
| Field | Correct | Minor Typo | Incorrect | Notes |
|-------|---------|------------|-----------|-------|
| `merchant` | **70.9%** (423/597) | 8.5% (51) | 20.6% (123) | Store names vary wildly in format |
| `date` | **86.9%** (519/597) | 1.0% (6) | 12.1% (72) | Highly consistent format |
| `subtotal` | **71.7%** (428/597) | 2.3% (14) | 26.0% (155) | Often missing on simple receipts |
| `tax` | **86.4%** (516/597) | 0.0% (0) | 13.6% (81) | Usually present when subtotal is |
| `total` | **47.4%** (283/597) | 7.9% (47) | 44.7% (267) | **Hardest field** β model confuses it with subtotal |
| `address` | **100.0%** (597/597) | 0.0% (0) | 0.0% (0) | **Test set has 0 address labels** β model correctly abstains |

### Overall Performance
```
Exact Match (all fields correct): 32.8% (196/597)
Usable Match (β€1 minor typo): 61.1% (365/597)
Any Incorrect Field: 38.9% (232/597)
```
> **Key insight 1:** The `total` field is the model's biggest weakness at 47.4% correct. This is because `total` and `subtotal` are visually similar numbers on receipts, and the model sometimes swaps them. Improving this would require stronger positional cues or a post-processing rule (always pick the larger number).
> **Key insight 2:** `address` at 100% is **not meaningful** β address labels are completely absent from the 5 test datasets (CORD, WildReceipts, etc. don't include address). The model correctly learned not to hallucinate it.
> **Why is Exact Match only 32.8%?** Receipt OCR is genuinely hard. The test datasets (CORD, WildReceipts, etc.) use different JSON schemas and raw output formats. The model learns normalized fields, but raw GT contains keys like `total_price`, `cashprice`, `changeprice` that don't align perfectly. The model is still useful β **61.1%** of receipts are "usable" with at most one small typo.
### Generating the Confusion Matrix Yourself
Run this on your Workbench to reproduce the evaluation:
```bash
python scripts/evaluate_model.py \
--model_path outputs/receipt_donut_gcp_enterprise/best_model \
--dataset_root receipt_datasets \
--output_dir evaluation_results
```
This outputs:
- `confusion_matrix.png` β Visual matrix per field
- `field_accuracy.json` β Numerical breakdown
- `error_analysis.html` β Side-by-side failures
---
## How to Use (Python)
### Installation
```bash
pip install transformers Pillow torch
```
### Single Image Inference
```python
import torch
from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
MODEL = "Awarebeyond/receipt-donut"
processor = DonutProcessor.from_pretrained(MODEL)
model = VisionEncoderDecoderModel.from_pretrained(MODEL)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device).eval()
def extract(image_path):
img = Image.open(image_path).convert("RGB")
pixel_values = processor(img, return_tensors="pt").pixel_values.to(device)
decoder_input_ids = torch.tensor([[model.config.decoder_start_token_id]]).to(device)
with torch.no_grad():
outputs = model.generate(
pixel_values,
decoder_input_ids=decoder_input_ids,
max_length=512,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=True,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
)
seq = processor.tokenizer.batch_decode(outputs.sequences)[0]
seq = seq.replace(processor.tokenizer.eos_token, "").replace(
processor.tokenizer.pad_token, ""
)
seq = seq.replace(
processor.tokenizer.decode([model.config.decoder_start_token_id]), ""
).strip()
return json.loads(seq)
result = extract("my_receipt.jpg")
print(json.dumps(result, indent=2))
```
### Batch Inference
```python
from glob import glob
receipts = glob("receipts/*.jpg")
results = [extract(r) for r in receipts]
# Save to JSON
with open("batch_results.json", "w") as f:
json.dump(results, f, indent=2)
```
---
## Model Architecture
```
Input Image (1536Γ1152)
β
Swin Transformer Encoder
β
Encoder Hidden States
β
BART Decoder (cross-attention)
β
JSON Text Tokens
```
- **Encoder:** Swin Transformer (hierarchical vision backbone)
- **Decoder:** BART (text generation with cross-attention)
- **Vocabulary:** ~5,000 tokens (includes special receipt tokens)
- **Parameters:** ~300M total
### Why Donut?
| Feature | OCR + NER Pipeline | Donut (End-to-End) |
|---------|-------------------|-------------------|
| Errors compound | OCR error β NER fails | Single model, single optimization |
| Layout handling | Requires separate layout model | Built into vision encoder |
| Speed | Multi-stage, slower | One forward pass |
| Maintenance | 3+ models to update | One model, one checkpoint |
---
## Limitations
1. **Resolution:** Works best on receipts with text height β₯ 10 pixels. Very low-res images may fail.
2. **Languages:** Primarily trained on English receipts. Other languages may produce lower accuracy.
3. **Handwriting:** Printed text works best. Cursive handwriting is not well supported.
4. **Field coverage:** Only extracts `merchant`, `date`, `subtotal`, `tax`, `total`, `address`. Line items are not extracted.
5. **Currency normalization:** Outputs raw strings (`$13.63`) β post-processing may be needed to convert to floats.
---
## Citation
If you use this model in research, please cite:
```bibtex
@misc{receipt_donut_2026,
title={Receipt Donut: Fine-tuned Document Understanding for Receipt Extraction},
author={Awarebeyond},
year={2026},
howpublished={\url{https://huggingface.co/Awarebeyond/receipt-donut}}
}
```
---
*Built with β€οΈ by a NAVTTC π΅π° student using Google Cloud Workbench (L4 GPU) and the Hugging Face ecosystem.*
|