File size: 15,387 Bytes
1a4ded0
3c91027
 
 
 
 
 
 
763f78a
 
 
 
 
 
 
 
 
 
3c91027
 
d92d333
 
1a4ded0
 
bc74c47
1a4ded0
d1b593c
1a4ded0
763f78a
d92d333
763f78a
d92d333
763f78a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3520ef6
763f78a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3928714
763f78a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3520ef6
763f78a
 
 
 
 
d92d333
3928714
763f78a
 
3928714
763f78a
 
3928714
763f78a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1b593c
763f78a
 
 
d1b593c
 
 
 
 
8625d3b
d1b593c
3928714
763f78a
 
 
 
d1b593c
 
 
763f78a
 
8625d3b
d1b593c
8625d3b
 
 
d92d333
763f78a
3c91027
763f78a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c91027
 
 
 
763f78a
3c91027
 
 
 
763f78a
 
3c91027
 
 
 
 
763f78a
 
 
3c91027
763f78a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c91027
763f78a
 
 
 
 
 
 
 
3c91027
763f78a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3520ef6
763f78a
 
3520ef6
763f78a
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
---
language:
- en
tags:
- image-to-text
- document-ai
- donut
- receipt-extraction
- ocr-free
datasets:
- Voxel51/scanned_receipts
- naver-clova-ix/cord-v2
- docjay131/receipts-ocr-dataset
- mychen76/invoices-and-receipts_ocr_v1
- mychen76/invoices-and-receipts_ocr_v2
- mychen76/wildreceipts_ocr_v1
- mychen76/receipt_cord_ocr_v2
- mychen76/ds_receipts_v2_train
pipeline_tag: image-to-text
widget:
  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/receipt.jpg
    example_title: Sample Receipt
---

# 🧾🍩 Receipt Donut β€” Complete Document for Understanding

> **Welcome!** This page explains every technical decision so you can understand (and replicate) the full training pipeline.

This model extracts structured JSON data directly from receipt images **without** needing a separate OCR engine. It is a fine-tuned version of `naver-clova-ix/donut-base-finetuned-cord-v2` trained on 8,615 real-world receipt images.

**Try it live:** [πŸš€ Hugging Face Space](https://huggingface.co/spaces/Awarebeyond/receipt-donut-space)

---

## πŸ“‹ Table of Contents
1. [What is Ground Truth?](#what-is-ground-truth)
2. [Training Configuration (YAML Deep Dive)](#training-configuration-yaml-deep-dive)
3. [Dataset & Train/Test/Val Split](#dataset--traintestval-split)
4. [Training Performance & Learning Curves](#training-performance--learning-curves)
5. [Confusion Matrix & Field-Level Evaluation](#confusion-matrix--field-level-evaluation)
6. [How to Use (Python)](#how-to-use-python)
7. [Model Architecture](#model-architecture)
8. [Limitations](#limitations)

---

## What is Ground Truth?

In machine learning, **Ground Truth** is the "correct answer" we teach the model to predict. For receipts, instead of raw OCR text, we use **structured JSON** so the model learns to output clean, labeled data.

### Example Ground Truth

```json
{
  "merchant": "Starbucks Coffee",
  "date": "2026-03-15",
  "subtotal": "$12.50",
  "tax": "$1.13",
  "total": "$13.63",
  "address": "123 Main St, New York, NY"
}
```

### Why JSON Ground Truth matters

| Approach | Problem | Our Solution |
|----------|---------|--------------|
| Raw OCR text | No structure β€” you get "Starbucks $13.63" | We label **keys** and **values** |
| Fixed template | Fails on receipts with different fields | JSON is flexible and self-describing |
| Named Entity Recognition | Requires post-processing pipeline | Donut outputs JSON **directly** |

### How we normalized different datasets

Receipt datasets use wildly different formats. We wrote `_normalize_gt()` to unify them:

```python
# WildReceipts uses a list of annotations:
annotations = [
  {"label": "store_name", "transcription": "Walmart"},
  {"label": "total_value", "transcription": "$45.20"}
]

# CORD uses nested JSON:
gt_parse = {
  "menu": [...],
  "total": {"price": "$45.20"}
}

# Our code converts ALL of these into a single normalized format:
{
  "merchant": "Walmart",
  "total": "$45.20"
}
```

We **skip samples with empty ground truth** to prevent the model from learning to output `{}`.

---

## Training Configuration (YAML Deep Dive)

Here is the exact `gcp_l4_enterprise.yaml` we used. Each parameter is explained so you understand **why** we chose it.

```yaml
model:
  model_name: "naver-clova-ix/donut-base-finetuned-cord-v2"
  max_length: 768
  image_size: [1536, 1152]  # Wider than tall for typical receipts

training:
  output_dir: "./outputs/receipt_donut_gcp_enterprise"
  num_train_epochs: 20       # Upper limit; early stopping at epoch 9
  batch_size: 4              # Fits in L4 24GB VRAM
  gradient_accumulation_steps: 16  # Effective batch = 4 Γ— 16 = 64
  learning_rate: 8.0e-5      # Higher LR for larger effective batch
  weight_decay: 0.01         # Prevents overfitting
  warmup_ratio: 0.05         # 5% of steps warm up LR from 0
  bf16: true                 # L4 GPU has native BFloat16 support
  gradient_checkpointing: true  # Trade compute for memory; enables larger batches
  label_smoothing: 0.1       # Softens targets; prevents overconfident predictions
  freeze_encoder_epochs: 1   # Train only decoder first (faster convergence)
  cosine_restart_epochs: 5   # LR schedule restarts every 5 epochs
  grayscale: true            # Reduces domain gap between color/gray receipts
  num_workers: 8             # Parallel data loading (L4 has 8 CPU cores)

data:
  dataset_root: "./receipt_datasets"
  train_split: 0.95          # 95% training
  val_split: 0.025           # 2.5% validation
  test_split: 0.025          # 2.5% holdout test
  seed: 42
  include_datasets:
    - "Voxel51__scanned_receipts"
    - "naver-clova-ix__cord-v2"
    - "docjay131__receipts-ocr-dataset"
    - "mychen76__invoices-and-receipts_ocr_v1"
    - "mychen76__invoices-and-receipts_ocr_v2"
    - "mychen76__wildreceipts_ocr_v1"
    - "mychen76__receipt_cord_ocr_v2"
    - "mychen76__ds_receipts_v2_train"

augmentation:
  enabled: true
  rotation_limit: 20         # Simulates tilted camera photos
  brightness_limit: 0.3      # Different lighting conditions
  contrast_limit: 0.3
  blur_prob: 0.5             # Camera shake / focus blur
  noise_prob: 0.5            # ISO noise in dark restaurants
  perspective_prob: 0.6      # Receipts photographed at an angle
  quality_lower: 40          # JPEG compression artifacts
  quality_upper: 100
```

### Key Concepts Explained

**Gradient Accumulation:** We process 4 images at a time, but accumulate gradients over 16 steps before updating weights. This gives us the stability of batch size 64 without needing 64Γ— the GPU memory.

**BFloat16 (bf16):** A half-precision number format. The L4 GPU has native bf16 hardware, so training is ~2Γ— faster and uses ~half the memory compared to fp32, with almost no accuracy loss.

**Gradient Checkpointing:** Instead of storing all intermediate activations in memory, we recompute them during backward pass. This lets us fit a bigger model/batch at the cost of ~20% slower training.

**Label Smoothing:** Normally the model is told "this token is 100% correct." With smoothing, we say "this token is 90% correct, others share the remaining 10%." This prevents the model from becoming overconfident.

---

## Dataset & Train/Test/Val Split

### Data Sources (8 Datasets, ~8,615 labeled samples)

| Dataset | Type | Approx. Samples | Notes |
|---------|------|-----------------|-------|
| CORD-v2 | Structured | ~800 | Clean, high-quality receipts |
| WildReceipts | List annotations | ~2,000 | Noisy real-world scans |
| Scanned Receipts | Image + OCR | ~1,000 | Voxel51 collection |
| Invoices & Receipts v1/v2 | Mixed | ~2,500 | mychen76 datasets |
| Receipt CORD OCR v2 | OCR pairs | ~1,000 | Double-escaped JSON (we fixed parsing) |
| DS Receipts v2 Train | Synthetic | ~1,000 | Also had double-escaped strings |

### Split Ratios

```
Total: 8,615 samples
β”œβ”€β”€ Train:     8,184  (95%)
β”œβ”€β”€ Val:         215  (2.5%)  β†’ Used to pick the best checkpoint
└── Test:        215  (2.5%)  β†’ Holdout set, never seen during training
```

We used a **single unified dataset loader** (`UnifiedReceiptDataset`) so all 8 datasets are mixed and shuffled together. This prevents the model from overfitting to any one receipt style.

### Why these splits?

- **95% train:** With <10k samples, we need as much training data as possible.
- **2.5% val:** Just enough to detect overfitting without wasting data.
- **2.5% test:** Final unbiased evaluation. In practice, we also evaluated visually on unseen real receipts.

---

## Training Performance & Learning Curves

### Loss Curve

![Learning Curve](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/learning_curve.png)

The model converged around **Epoch 9**. Training was stopped early because:
- Validation loss plateaued
- No improvement for 3 consecutive epochs
- Further training risked overfitting

### Key Metrics

| Metric | Value |
|--------|-------|
| Total training samples | 8,615 |
| Effective batch size | 64 |
| Peak learning rate | 8.0e-5 |
| Training precision | bf16 |
| GPU | NVIDIA L4 (24 GB VRAM) |
| Training duration | ~10 hours actual (+ ~12 hours trial/error) |
| Early stopping epoch | 9 / 20 |

### Sample Visual Results

Below are real model outputs on the validation set (Original Image vs. Predicted JSON).

![Sample 1](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_0.png)
*Example 1: Correctly extracted merchant, date, and total.*

![Sample 2](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_1.png)
*Example 2: Handled a partially blurred receipt with minor date typo.*

![Sample 3](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_2.png)
*Example 3: Multi-line address and tax amount correctly parsed.*

---

## Confusion Matrix & Field-Level Evaluation

Since this is a **generative text model** (not a classifier), a traditional confusion matrix doesn't apply. Instead, we evaluate each extracted field with a **Field-Level Confusion Matrix** based on string similarity.

### Evaluation Categories

| Category | Criteria | Example |
|----------|----------|---------|
| βœ… **Correct** | 100% character match | `$13.63` == `$13.63` |
| ⚠️ **Minor Typo** | < 20% Levenshtein distance | `Starbuks` vs `Starbucks` |
| ❌ **Incorrect** | > 20% distance or missing | `null` vs `Walmart` |

### Field-Level Confusion Matrix (Test Set β€” 597 Samples)

| Field | Correct | Minor Typo | Incorrect | Notes |
|-------|---------|------------|-----------|-------|
| `merchant` | **70.9%** (423/597) | 8.5% (51) | 20.6% (123) | Store names vary wildly in format |
| `date` | **86.9%** (519/597) | 1.0% (6) | 12.1% (72) | Highly consistent format |
| `subtotal` | **71.7%** (428/597) | 2.3% (14) | 26.0% (155) | Often missing on simple receipts |
| `tax` | **86.4%** (516/597) | 0.0% (0) | 13.6% (81) | Usually present when subtotal is |
| `total` | **47.4%** (283/597) | 7.9% (47) | 44.7% (267) | **Hardest field** β€” model confuses it with subtotal |
| `address` | **100.0%** (597/597) | 0.0% (0) | 0.0% (0) | **Test set has 0 address labels** β€” model correctly abstains |

![Field Confusion Matrix](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/field_confusion_matrix.png)

### Overall Performance

```
Exact Match (all fields correct): 32.8% (196/597)
Usable Match (≀1 minor typo):    61.1% (365/597)
Any Incorrect Field:             38.9% (232/597)
```

> **Key insight 1:** The `total` field is the model's biggest weakness at 47.4% correct. This is because `total` and `subtotal` are visually similar numbers on receipts, and the model sometimes swaps them. Improving this would require stronger positional cues or a post-processing rule (always pick the larger number).

> **Key insight 2:** `address` at 100% is **not meaningful** β€” address labels are completely absent from the 5 test datasets (CORD, WildReceipts, etc. don't include address). The model correctly learned not to hallucinate it.

> **Why is Exact Match only 32.8%?** Receipt OCR is genuinely hard. The test datasets (CORD, WildReceipts, etc.) use different JSON schemas and raw output formats. The model learns normalized fields, but raw GT contains keys like `total_price`, `cashprice`, `changeprice` that don't align perfectly. The model is still useful β€” **61.1%** of receipts are "usable" with at most one small typo.

### Generating the Confusion Matrix Yourself

Run this on your Workbench to reproduce the evaluation:

```bash
python scripts/evaluate_model.py \
  --model_path outputs/receipt_donut_gcp_enterprise/best_model \
  --dataset_root receipt_datasets \
  --output_dir evaluation_results
```

This outputs:
- `confusion_matrix.png` β€” Visual matrix per field
- `field_accuracy.json` β€” Numerical breakdown
- `error_analysis.html` β€” Side-by-side failures

---

## How to Use (Python)

### Installation

```bash
pip install transformers Pillow torch
```

### Single Image Inference

```python
import torch
from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image

MODEL = "Awarebeyond/receipt-donut"
processor = DonutProcessor.from_pretrained(MODEL)
model = VisionEncoderDecoderModel.from_pretrained(MODEL)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device).eval()

def extract(image_path):
    img = Image.open(image_path).convert("RGB")
    pixel_values = processor(img, return_tensors="pt").pixel_values.to(device)
    decoder_input_ids = torch.tensor([[model.config.decoder_start_token_id]]).to(device)

    with torch.no_grad():
        outputs = model.generate(
            pixel_values,
            decoder_input_ids=decoder_input_ids,
            max_length=512,
            pad_token_id=processor.tokenizer.pad_token_id,
            eos_token_id=processor.tokenizer.eos_token_id,
            use_cache=True,
            bad_words_ids=[[processor.tokenizer.unk_token_id]],
        )

    seq = processor.tokenizer.batch_decode(outputs.sequences)[0]
    seq = seq.replace(processor.tokenizer.eos_token, "").replace(
        processor.tokenizer.pad_token, ""
    )
    seq = seq.replace(
        processor.tokenizer.decode([model.config.decoder_start_token_id]), ""
    ).strip()

    return json.loads(seq)

result = extract("my_receipt.jpg")
print(json.dumps(result, indent=2))
```

### Batch Inference

```python
from glob import glob

receipts = glob("receipts/*.jpg")
results = [extract(r) for r in receipts]

# Save to JSON
with open("batch_results.json", "w") as f:
    json.dump(results, f, indent=2)
```

---

## Model Architecture

```
Input Image (1536Γ—1152)
    ↓
Swin Transformer Encoder
    ↓
Encoder Hidden States
    ↓
BART Decoder (cross-attention)
    ↓
JSON Text Tokens
```

- **Encoder:** Swin Transformer (hierarchical vision backbone)
- **Decoder:** BART (text generation with cross-attention)
- **Vocabulary:** ~5,000 tokens (includes special receipt tokens)
- **Parameters:** ~300M total

### Why Donut?

| Feature | OCR + NER Pipeline | Donut (End-to-End) |
|---------|-------------------|-------------------|
| Errors compound | OCR error β†’ NER fails | Single model, single optimization |
| Layout handling | Requires separate layout model | Built into vision encoder |
| Speed | Multi-stage, slower | One forward pass |
| Maintenance | 3+ models to update | One model, one checkpoint |

---

## Limitations

1. **Resolution:** Works best on receipts with text height β‰₯ 10 pixels. Very low-res images may fail.
2. **Languages:** Primarily trained on English receipts. Other languages may produce lower accuracy.
3. **Handwriting:** Printed text works best. Cursive handwriting is not well supported.
4. **Field coverage:** Only extracts `merchant`, `date`, `subtotal`, `tax`, `total`, `address`. Line items are not extracted.
5. **Currency normalization:** Outputs raw strings (`$13.63`) β€” post-processing may be needed to convert to floats.

---

## Citation

If you use this model in research, please cite:

```bibtex
@misc{receipt_donut_2026,
  title={Receipt Donut: Fine-tuned Document Understanding for Receipt Extraction},
  author={Awarebeyond},
  year={2026},
  howpublished={\url{https://huggingface.co/Awarebeyond/receipt-donut}}
}
```

---

*Built with ❀️ by a NAVTTC πŸ‡΅πŸ‡° student using Google Cloud Workbench (L4 GPU) and the Hugging Face ecosystem.*