File size: 15,387 Bytes

---
language:
- en
tags:
- image-to-text
- document-ai
- donut
- receipt-extraction
- ocr-free
datasets:
- Voxel51/scanned_receipts
- naver-clova-ix/cord-v2
- docjay131/receipts-ocr-dataset
- mychen76/invoices-and-receipts_ocr_v1
- mychen76/invoices-and-receipts_ocr_v2
- mychen76/wildreceipts_ocr_v1
- mychen76/receipt_cord_ocr_v2
- mychen76/ds_receipts_v2_train
pipeline_tag: image-to-text
widget:
  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/receipt.jpg
    example_title: Sample Receipt
---

# 🧾🍩 Receipt Donut — Complete Document for Understanding

> **Welcome!** This page explains every technical decision so you can understand (and replicate) the full training pipeline.

This model extracts structured JSON data directly from receipt images **without** needing a separate OCR engine. It is a fine-tuned version of `naver-clova-ix/donut-base-finetuned-cord-v2` trained on 8,615 real-world receipt images.

**Try it live:** [🚀 Hugging Face Space](https://huggingface.co/spaces/Awarebeyond/receipt-donut-space)

---

## 📋 Table of Contents
1. [What is Ground Truth?](#what-is-ground-truth)
2. [Training Configuration (YAML Deep Dive)](#training-configuration-yaml-deep-dive)
3. [Dataset & Train/Test/Val Split](#dataset--traintestval-split)
4. [Training Performance & Learning Curves](#training-performance--learning-curves)
5. [Confusion Matrix & Field-Level Evaluation](#confusion-matrix--field-level-evaluation)
6. [How to Use (Python)](#how-to-use-python)
7. [Model Architecture](#model-architecture)
8. [Limitations](#limitations)

---

## What is Ground Truth?

In machine learning, **Ground Truth** is the "correct answer" we teach the model to predict. For receipts, instead of raw OCR text, we use **structured JSON** so the model learns to output clean, labeled data.

### Example Ground Truth

```json
{
  "merchant": "Starbucks Coffee",
  "date": "2026-03-15",
  "subtotal": "$12.50",
  "tax": "$1.13",
  "total": "$13.63",
  "address": "123 Main St, New York, NY"
}
```

### Why JSON Ground Truth matters

| Approach | Problem | Our Solution |
|----------|---------|--------------|
| Raw OCR text | No structure — you get "Starbucks $13.63" | We label **keys** and **values** |
| Fixed template | Fails on receipts with different fields | JSON is flexible and self-describing |
| Named Entity Recognition | Requires post-processing pipeline | Donut outputs JSON **directly** |

### How we normalized different datasets

Receipt datasets use wildly different formats. We wrote `_normalize_gt()` to unify them:

```python
# WildReceipts uses a list of annotations:
annotations = [
  {"label": "store_name", "transcription": "Walmart"},
  {"label": "total_value", "transcription": "$45.20"}
]

# CORD uses nested JSON:
gt_parse = {
  "menu": [...],
  "total": {"price": "$45.20"}
}

# Our code converts ALL of these into a single normalized format:
{
  "merchant": "Walmart",
  "total": "$45.20"
}
```

We **skip samples with empty ground truth** to prevent the model from learning to output `{}`.

---

## Training Configuration (YAML Deep Dive)

Here is the exact `gcp_l4_enterprise.yaml` we used. Each parameter is explained so you understand **why** we chose it.

```yaml
model:
  model_name: "naver-clova-ix/donut-base-finetuned-cord-v2"
  max_length: 768
  image_size: [1536, 1152]  # Wider than tall for typical receipts

training:
  output_dir: "./outputs/receipt_donut_gcp_enterprise"
  num_train_epochs: 20       # Upper limit; early stopping at epoch 9
  batch_size: 4              # Fits in L4 24GB VRAM
  gradient_accumulation_steps: 16  # Effective batch = 4 × 16 = 64
  learning_rate: 8.0e-5      # Higher LR for larger effective batch
  weight_decay: 0.01         # Prevents overfitting
  warmup_ratio: 0.05         # 5% of steps warm up LR from 0
  bf16: true                 # L4 GPU has native BFloat16 support
  gradient_checkpointing: true  # Trade compute for memory; enables larger batches
  label_smoothing: 0.1       # Softens targets; prevents overconfident predictions
  freeze_encoder_epochs: 1   # Train only decoder first (faster convergence)
  cosine_restart_epochs: 5   # LR schedule restarts every 5 epochs
  grayscale: true            # Reduces domain gap between color/gray receipts
  num_workers: 8             # Parallel data loading (L4 has 8 CPU cores)

data:
  dataset_root: "./receipt_datasets"
  train_split: 0.95          # 95% training
  val_split: 0.025           # 2.5% validation
  test_split: 0.025          # 2.5% holdout test
  seed: 42
  include_datasets:
    - "Voxel51__scanned_receipts"
    - "naver-clova-ix__cord-v2"
    - "docjay131__receipts-ocr-dataset"
    - "mychen76__invoices-and-receipts_ocr_v1"
    - "mychen76__invoices-and-receipts_ocr_v2"
    - "mychen76__wildreceipts_ocr_v1"
    - "mychen76__receipt_cord_ocr_v2"
    - "mychen76__ds_receipts_v2_train"

augmentation:
  enabled: true
  rotation_limit: 20         # Simulates tilted camera photos
  brightness_limit: 0.3      # Different lighting conditions
  contrast_limit: 0.3
  blur_prob: 0.5             # Camera shake / focus blur
  noise_prob: 0.5            # ISO noise in dark restaurants
  perspective_prob: 0.6      # Receipts photographed at an angle
  quality_lower: 40          # JPEG compression artifacts
  quality_upper: 100
```

### Key Concepts Explained

**Gradient Accumulation:** We process 4 images at a time, but accumulate gradients over 16 steps before updating weights. This gives us the stability of batch size 64 without needing 64× the GPU memory.

**BFloat16 (bf16):** A half-precision number format. The L4 GPU has native bf16 hardware, so training is ~2× faster and uses ~half the memory compared to fp32, with almost no accuracy loss.

**Gradient Checkpointing:** Instead of storing all intermediate activations in memory, we recompute them during backward pass. This lets us fit a bigger model/batch at the cost of ~20% slower training.

**Label Smoothing:** Normally the model is told "this token is 100% correct." With smoothing, we say "this token is 90% correct, others share the remaining 10%." This prevents the model from becoming overconfident.

---

## Dataset & Train/Test/Val Split

### Data Sources (8 Datasets, ~8,615 labeled samples)

| Dataset | Type | Approx. Samples | Notes |
|---------|------|-----------------|-------|
| CORD-v2 | Structured | ~800 | Clean, high-quality receipts |
| WildReceipts | List annotations | ~2,000 | Noisy real-world scans |
| Scanned Receipts | Image + OCR | ~1,000 | Voxel51 collection |
| Invoices & Receipts v1/v2 | Mixed | ~2,500 | mychen76 datasets |
| Receipt CORD OCR v2 | OCR pairs | ~1,000 | Double-escaped JSON (we fixed parsing) |
| DS Receipts v2 Train | Synthetic | ~1,000 | Also had double-escaped strings |

### Split Ratios

```
Total: 8,615 samples
├── Train:     8,184  (95%)
├── Val:         215  (2.5%)  → Used to pick the best checkpoint
└── Test:        215  (2.5%)  → Holdout set, never seen during training
```

We used a **single unified dataset loader** (`UnifiedReceiptDataset`) so all 8 datasets are mixed and shuffled together. This prevents the model from overfitting to any one receipt style.

### Why these splits?

- **95% train:** With <10k samples, we need as much training data as possible.
- **2.5% val:** Just enough to detect overfitting without wasting data.
- **2.5% test:** Final unbiased evaluation. In practice, we also evaluated visually on unseen real receipts.

---

## Training Performance & Learning Curves

### Loss Curve

![Learning Curve](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/learning_curve.png)

The model converged around **Epoch 9**. Training was stopped early because:
- Validation loss plateaued
- No improvement for 3 consecutive epochs
- Further training risked overfitting

### Key Metrics

| Metric | Value |
|--------|-------|
| Total training samples | 8,615 |
| Effective batch size | 64 |
| Peak learning rate | 8.0e-5 |
| Training precision | bf16 |
| GPU | NVIDIA L4 (24 GB VRAM) |
| Training duration | ~10 hours actual (+ ~12 hours trial/error) |
| Early stopping epoch | 9 / 20 |

### Sample Visual Results

Below are real model outputs on the validation set (Original Image vs. Predicted JSON).

![Sample 1](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_0.png)
*Example 1: Correctly extracted merchant, date, and total.*

![Sample 2](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_1.png)
*Example 2: Handled a partially blurred receipt with minor date typo.*

![Sample 3](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_2.png)
*Example 3: Multi-line address and tax amount correctly parsed.*

---

## Confusion Matrix & Field-Level Evaluation

Since this is a **generative text model** (not a classifier), a traditional confusion matrix doesn't apply. Instead, we evaluate each extracted field with a **Field-Level Confusion Matrix** based on string similarity.

### Evaluation Categories

| Category | Criteria | Example |
|----------|----------|---------|
| ✅ **Correct** | 100% character match | `$13.63` == `$13.63` |
| ⚠️ **Minor Typo** | < 20% Levenshtein distance | `Starbuks` vs `Starbucks` |
| ❌ **Incorrect** | > 20% distance or missing | `null` vs `Walmart` |

### Field-Level Confusion Matrix (Test Set — 597 Samples)

| Field | Correct | Minor Typo | Incorrect | Notes |
|-------|---------|------------|-----------|-------|
| `merchant` | **70.9%** (423/597) | 8.5% (51) | 20.6% (123) | Store names vary wildly in format |
| `date` | **86.9%** (519/597) | 1.0% (6) | 12.1% (72) | Highly consistent format |
| `subtotal` | **71.7%** (428/597) | 2.3% (14) | 26.0% (155) | Often missing on simple receipts |
| `tax` | **86.4%** (516/597) | 0.0% (0) | 13.6% (81) | Usually present when subtotal is |
| `total` | **47.4%** (283/597) | 7.9% (47) | 44.7% (267) | **Hardest field** — model confuses it with subtotal |
| `address` | **100.0%** (597/597) | 0.0% (0) | 0.0% (0) | **Test set has 0 address labels** — model correctly abstains |

![Field Confusion Matrix](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/field_confusion_matrix.png)

### Overall Performance

```
Exact Match (all fields correct): 32.8% (196/597)
Usable Match (≤1 minor typo):    61.1% (365/597)
Any Incorrect Field:             38.9% (232/597)
```

> **Key insight 1:** The `total` field is the model's biggest weakness at 47.4% correct. This is because `total` and `subtotal` are visually similar numbers on receipts, and the model sometimes swaps them. Improving this would require stronger positional cues or a post-processing rule (always pick the larger number).

> **Key insight 2:** `address` at 100% is **not meaningful** — address labels are completely absent from the 5 test datasets (CORD, WildReceipts, etc. don't include address). The model correctly learned not to hallucinate it.

> **Why is Exact Match only 32.8%?** Receipt OCR is genuinely hard. The test datasets (CORD, WildReceipts, etc.) use different JSON schemas and raw output formats. The model learns normalized fields, but raw GT contains keys like `total_price`, `cashprice`, `changeprice` that don't align perfectly. The model is still useful — **61.1%** of receipts are "usable" with at most one small typo.

### Generating the Confusion Matrix Yourself

Run this on your Workbench to reproduce the evaluation:

```bash
python scripts/evaluate_model.py \
  --model_path outputs/receipt_donut_gcp_enterprise/best_model \
  --dataset_root receipt_datasets \
  --output_dir evaluation_results
```

This outputs:
- `confusion_matrix.png` — Visual matrix per field
- `field_accuracy.json` — Numerical breakdown
- `error_analysis.html` — Side-by-side failures

---

## How to Use (Python)

### Installation

```bash
pip install transformers Pillow torch
```

### Single Image Inference

```python
import torch
from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image

MODEL = "Awarebeyond/receipt-donut"
processor = DonutProcessor.from_pretrained(MODEL)
model = VisionEncoderDecoderModel.from_pretrained(MODEL)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device).eval()

def extract(image_path):
    img = Image.open(image_path).convert("RGB")
    pixel_values = processor(img, return_tensors="pt").pixel_values.to(device)
    decoder_input_ids = torch.tensor([[model.config.decoder_start_token_id]]).to(device)

    with torch.no_grad():
        outputs = model.generate(
            pixel_values,
            decoder_input_ids=decoder_input_ids,
            max_length=512,
            pad_token_id=processor.tokenizer.pad_token_id,
            eos_token_id=processor.tokenizer.eos_token_id,
            use_cache=True,
            bad_words_ids=[[processor.tokenizer.unk_token_id]],
        )

    seq = processor.tokenizer.batch_decode(outputs.sequences)[0]
    seq = seq.replace(processor.tokenizer.eos_token, "").replace(
        processor.tokenizer.pad_token, ""
    )
    seq = seq.replace(
        processor.tokenizer.decode([model.config.decoder_start_token_id]), ""
    ).strip()

    return json.loads(seq)

result = extract("my_receipt.jpg")
print(json.dumps(result, indent=2))
```

### Batch Inference

```python
from glob import glob

receipts = glob("receipts/*.jpg")
results = [extract(r) for r in receipts]

# Save to JSON
with open("batch_results.json", "w") as f:
    json.dump(results, f, indent=2)
```

---

## Model Architecture

```
Input Image (1536×1152)
    ↓
Swin Transformer Encoder
    ↓
Encoder Hidden States
    ↓
BART Decoder (cross-attention)
    ↓
JSON Text Tokens
```

- **Encoder:** Swin Transformer (hierarchical vision backbone)
- **Decoder:** BART (text generation with cross-attention)
- **Vocabulary:** ~5,000 tokens (includes special receipt tokens)
- **Parameters:** ~300M total

### Why Donut?

| Feature | OCR + NER Pipeline | Donut (End-to-End) |
|---------|-------------------|-------------------|
| Errors compound | OCR error → NER fails | Single model, single optimization |
| Layout handling | Requires separate layout model | Built into vision encoder |
| Speed | Multi-stage, slower | One forward pass |
| Maintenance | 3+ models to update | One model, one checkpoint |

---

## Limitations

1. **Resolution:** Works best on receipts with text height ≥ 10 pixels. Very low-res images may fail.
2. **Languages:** Primarily trained on English receipts. Other languages may produce lower accuracy.
3. **Handwriting:** Printed text works best. Cursive handwriting is not well supported.
4. **Field coverage:** Only extracts `merchant`, `date`, `subtotal`, `tax`, `total`, `address`. Line items are not extracted.
5. **Currency normalization:** Outputs raw strings (`$13.63`) — post-processing may be needed to convert to floats.

---

## Citation

If you use this model in research, please cite:

```bibtex
@misc{receipt_donut_2026,
  title={Receipt Donut: Fine-tuned Document Understanding for Receipt Extraction},
  author={Awarebeyond},
  year={2026},
  howpublished={\url{https://huggingface.co/Awarebeyond/receipt-donut}}
}
```

---

*Built with ❤️ by a NAVTTC 🇵🇰 student using Google Cloud Workbench (L4 GPU) and the Hugging Face ecosystem.*