| --- |
| language: |
| - en |
| tags: |
| - image-to-text |
| - document-ai |
| - donut |
| - receipt-extraction |
| - ocr-free |
| datasets: |
| - Voxel51/scanned_receipts |
| - naver-clova-ix/cord-v2 |
| - docjay131/receipts-ocr-dataset |
| - mychen76/invoices-and-receipts_ocr_v1 |
| - mychen76/invoices-and-receipts_ocr_v2 |
| - mychen76/wildreceipts_ocr_v1 |
| - mychen76/receipt_cord_ocr_v2 |
| - mychen76/ds_receipts_v2_train |
| pipeline_tag: image-to-text |
| widget: |
| - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/receipt.jpg |
| example_title: Sample Receipt |
| --- |
| |
| # π§Ύπ© Receipt Donut β Complete Document for Understanding |
|
|
| > **Welcome!** This page explains every technical decision so you can understand (and replicate) the full training pipeline. |
|
|
| This model extracts structured JSON data directly from receipt images **without** needing a separate OCR engine. It is a fine-tuned version of `naver-clova-ix/donut-base-finetuned-cord-v2` trained on 8,615 real-world receipt images. |
|
|
| **Try it live:** [π Hugging Face Space](https://huggingface.co/spaces/Awarebeyond/receipt-donut-space) |
|
|
| --- |
|
|
| ## π Table of Contents |
| 1. [What is Ground Truth?](#what-is-ground-truth) |
| 2. [Training Configuration (YAML Deep Dive)](#training-configuration-yaml-deep-dive) |
| 3. [Dataset & Train/Test/Val Split](#dataset--traintestval-split) |
| 4. [Training Performance & Learning Curves](#training-performance--learning-curves) |
| 5. [Confusion Matrix & Field-Level Evaluation](#confusion-matrix--field-level-evaluation) |
| 6. [How to Use (Python)](#how-to-use-python) |
| 7. [Model Architecture](#model-architecture) |
| 8. [Limitations](#limitations) |
|
|
| --- |
|
|
| ## What is Ground Truth? |
|
|
| In machine learning, **Ground Truth** is the "correct answer" we teach the model to predict. For receipts, instead of raw OCR text, we use **structured JSON** so the model learns to output clean, labeled data. |
|
|
| ### Example Ground Truth |
|
|
| ```json |
| { |
| "merchant": "Starbucks Coffee", |
| "date": "2026-03-15", |
| "subtotal": "$12.50", |
| "tax": "$1.13", |
| "total": "$13.63", |
| "address": "123 Main St, New York, NY" |
| } |
| ``` |
|
|
| ### Why JSON Ground Truth matters |
|
|
| | Approach | Problem | Our Solution | |
| |----------|---------|--------------| |
| | Raw OCR text | No structure β you get "Starbucks $13.63" | We label **keys** and **values** | |
| | Fixed template | Fails on receipts with different fields | JSON is flexible and self-describing | |
| | Named Entity Recognition | Requires post-processing pipeline | Donut outputs JSON **directly** | |
|
|
| ### How we normalized different datasets |
|
|
| Receipt datasets use wildly different formats. We wrote `_normalize_gt()` to unify them: |
|
|
| ```python |
| # WildReceipts uses a list of annotations: |
| annotations = [ |
| {"label": "store_name", "transcription": "Walmart"}, |
| {"label": "total_value", "transcription": "$45.20"} |
| ] |
| |
| # CORD uses nested JSON: |
| gt_parse = { |
| "menu": [...], |
| "total": {"price": "$45.20"} |
| } |
| |
| # Our code converts ALL of these into a single normalized format: |
| { |
| "merchant": "Walmart", |
| "total": "$45.20" |
| } |
| ``` |
|
|
| We **skip samples with empty ground truth** to prevent the model from learning to output `{}`. |
|
|
| --- |
|
|
| ## Training Configuration (YAML Deep Dive) |
|
|
| Here is the exact `gcp_l4_enterprise.yaml` we used. Each parameter is explained so you understand **why** we chose it. |
|
|
| ```yaml |
| model: |
| model_name: "naver-clova-ix/donut-base-finetuned-cord-v2" |
| max_length: 768 |
| image_size: [1536, 1152] # Wider than tall for typical receipts |
| |
| training: |
| output_dir: "./outputs/receipt_donut_gcp_enterprise" |
| num_train_epochs: 20 # Upper limit; early stopping at epoch 9 |
| batch_size: 4 # Fits in L4 24GB VRAM |
| gradient_accumulation_steps: 16 # Effective batch = 4 Γ 16 = 64 |
| learning_rate: 8.0e-5 # Higher LR for larger effective batch |
| weight_decay: 0.01 # Prevents overfitting |
| warmup_ratio: 0.05 # 5% of steps warm up LR from 0 |
| bf16: true # L4 GPU has native BFloat16 support |
| gradient_checkpointing: true # Trade compute for memory; enables larger batches |
| label_smoothing: 0.1 # Softens targets; prevents overconfident predictions |
| freeze_encoder_epochs: 1 # Train only decoder first (faster convergence) |
| cosine_restart_epochs: 5 # LR schedule restarts every 5 epochs |
| grayscale: true # Reduces domain gap between color/gray receipts |
| num_workers: 8 # Parallel data loading (L4 has 8 CPU cores) |
| |
| data: |
| dataset_root: "./receipt_datasets" |
| train_split: 0.95 # 95% training |
| val_split: 0.025 # 2.5% validation |
| test_split: 0.025 # 2.5% holdout test |
| seed: 42 |
| include_datasets: |
| - "Voxel51__scanned_receipts" |
| - "naver-clova-ix__cord-v2" |
| - "docjay131__receipts-ocr-dataset" |
| - "mychen76__invoices-and-receipts_ocr_v1" |
| - "mychen76__invoices-and-receipts_ocr_v2" |
| - "mychen76__wildreceipts_ocr_v1" |
| - "mychen76__receipt_cord_ocr_v2" |
| - "mychen76__ds_receipts_v2_train" |
| |
| augmentation: |
| enabled: true |
| rotation_limit: 20 # Simulates tilted camera photos |
| brightness_limit: 0.3 # Different lighting conditions |
| contrast_limit: 0.3 |
| blur_prob: 0.5 # Camera shake / focus blur |
| noise_prob: 0.5 # ISO noise in dark restaurants |
| perspective_prob: 0.6 # Receipts photographed at an angle |
| quality_lower: 40 # JPEG compression artifacts |
| quality_upper: 100 |
| ``` |
|
|
| ### Key Concepts Explained |
|
|
| **Gradient Accumulation:** We process 4 images at a time, but accumulate gradients over 16 steps before updating weights. This gives us the stability of batch size 64 without needing 64Γ the GPU memory. |
|
|
| **BFloat16 (bf16):** A half-precision number format. The L4 GPU has native bf16 hardware, so training is ~2Γ faster and uses ~half the memory compared to fp32, with almost no accuracy loss. |
|
|
| **Gradient Checkpointing:** Instead of storing all intermediate activations in memory, we recompute them during backward pass. This lets us fit a bigger model/batch at the cost of ~20% slower training. |
|
|
| **Label Smoothing:** Normally the model is told "this token is 100% correct." With smoothing, we say "this token is 90% correct, others share the remaining 10%." This prevents the model from becoming overconfident. |
|
|
| --- |
|
|
| ## Dataset & Train/Test/Val Split |
|
|
| ### Data Sources (8 Datasets, ~8,615 labeled samples) |
|
|
| | Dataset | Type | Approx. Samples | Notes | |
| |---------|------|-----------------|-------| |
| | CORD-v2 | Structured | ~800 | Clean, high-quality receipts | |
| | WildReceipts | List annotations | ~2,000 | Noisy real-world scans | |
| | Scanned Receipts | Image + OCR | ~1,000 | Voxel51 collection | |
| | Invoices & Receipts v1/v2 | Mixed | ~2,500 | mychen76 datasets | |
| | Receipt CORD OCR v2 | OCR pairs | ~1,000 | Double-escaped JSON (we fixed parsing) | |
| | DS Receipts v2 Train | Synthetic | ~1,000 | Also had double-escaped strings | |
|
|
| ### Split Ratios |
|
|
| ``` |
| Total: 8,615 samples |
| βββ Train: 8,184 (95%) |
| βββ Val: 215 (2.5%) β Used to pick the best checkpoint |
| βββ Test: 215 (2.5%) β Holdout set, never seen during training |
| ``` |
|
|
| We used a **single unified dataset loader** (`UnifiedReceiptDataset`) so all 8 datasets are mixed and shuffled together. This prevents the model from overfitting to any one receipt style. |
|
|
| ### Why these splits? |
|
|
| - **95% train:** With <10k samples, we need as much training data as possible. |
| - **2.5% val:** Just enough to detect overfitting without wasting data. |
| - **2.5% test:** Final unbiased evaluation. In practice, we also evaluated visually on unseen real receipts. |
|
|
| --- |
|
|
| ## Training Performance & Learning Curves |
|
|
| ### Loss Curve |
|
|
|  |
|
|
| The model converged around **Epoch 9**. Training was stopped early because: |
| - Validation loss plateaued |
| - No improvement for 3 consecutive epochs |
| - Further training risked overfitting |
|
|
| ### Key Metrics |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Total training samples | 8,615 | |
| | Effective batch size | 64 | |
| | Peak learning rate | 8.0e-5 | |
| | Training precision | bf16 | |
| | GPU | NVIDIA L4 (24 GB VRAM) | |
| | Training duration | ~10 hours actual (+ ~12 hours trial/error) | |
| | Early stopping epoch | 9 / 20 | |
|
|
| ### Sample Visual Results |
|
|
| Below are real model outputs on the validation set (Original Image vs. Predicted JSON). |
|
|
|  |
| *Example 1: Correctly extracted merchant, date, and total.* |
|
|
|  |
| *Example 2: Handled a partially blurred receipt with minor date typo.* |
|
|
|  |
| *Example 3: Multi-line address and tax amount correctly parsed.* |
|
|
| --- |
|
|
| ## Confusion Matrix & Field-Level Evaluation |
|
|
| Since this is a **generative text model** (not a classifier), a traditional confusion matrix doesn't apply. Instead, we evaluate each extracted field with a **Field-Level Confusion Matrix** based on string similarity. |
|
|
| ### Evaluation Categories |
|
|
| | Category | Criteria | Example | |
| |----------|----------|---------| |
| | β
**Correct** | 100% character match | `$13.63` == `$13.63` | |
| | β οΈ **Minor Typo** | < 20% Levenshtein distance | `Starbuks` vs `Starbucks` | |
| | β **Incorrect** | > 20% distance or missing | `null` vs `Walmart` | |
|
|
| ### Field-Level Confusion Matrix (Test Set β 597 Samples) |
|
|
| | Field | Correct | Minor Typo | Incorrect | Notes | |
| |-------|---------|------------|-----------|-------| |
| | `merchant` | **70.9%** (423/597) | 8.5% (51) | 20.6% (123) | Store names vary wildly in format | |
| | `date` | **86.9%** (519/597) | 1.0% (6) | 12.1% (72) | Highly consistent format | |
| | `subtotal` | **71.7%** (428/597) | 2.3% (14) | 26.0% (155) | Often missing on simple receipts | |
| | `tax` | **86.4%** (516/597) | 0.0% (0) | 13.6% (81) | Usually present when subtotal is | |
| | `total` | **47.4%** (283/597) | 7.9% (47) | 44.7% (267) | **Hardest field** β model confuses it with subtotal | |
| | `address` | **100.0%** (597/597) | 0.0% (0) | 0.0% (0) | **Test set has 0 address labels** β model correctly abstains | |
|
|
|  |
|
|
| ### Overall Performance |
|
|
| ``` |
| Exact Match (all fields correct): 32.8% (196/597) |
| Usable Match (β€1 minor typo): 61.1% (365/597) |
| Any Incorrect Field: 38.9% (232/597) |
| ``` |
|
|
| > **Key insight 1:** The `total` field is the model's biggest weakness at 47.4% correct. This is because `total` and `subtotal` are visually similar numbers on receipts, and the model sometimes swaps them. Improving this would require stronger positional cues or a post-processing rule (always pick the larger number). |
|
|
| > **Key insight 2:** `address` at 100% is **not meaningful** β address labels are completely absent from the 5 test datasets (CORD, WildReceipts, etc. don't include address). The model correctly learned not to hallucinate it. |
|
|
| > **Why is Exact Match only 32.8%?** Receipt OCR is genuinely hard. The test datasets (CORD, WildReceipts, etc.) use different JSON schemas and raw output formats. The model learns normalized fields, but raw GT contains keys like `total_price`, `cashprice`, `changeprice` that don't align perfectly. The model is still useful β **61.1%** of receipts are "usable" with at most one small typo. |
| |
| ### Generating the Confusion Matrix Yourself |
| |
| Run this on your Workbench to reproduce the evaluation: |
| |
| ```bash |
| python scripts/evaluate_model.py \ |
| --model_path outputs/receipt_donut_gcp_enterprise/best_model \ |
| --dataset_root receipt_datasets \ |
| --output_dir evaluation_results |
| ``` |
| |
| This outputs: |
| - `confusion_matrix.png` β Visual matrix per field |
| - `field_accuracy.json` β Numerical breakdown |
| - `error_analysis.html` β Side-by-side failures |
|
|
| --- |
|
|
| ## How to Use (Python) |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers Pillow torch |
| ``` |
|
|
| ### Single Image Inference |
|
|
| ```python |
| import torch |
| from transformers import DonutProcessor, VisionEncoderDecoderModel |
| from PIL import Image |
| |
| MODEL = "Awarebeyond/receipt-donut" |
| processor = DonutProcessor.from_pretrained(MODEL) |
| model = VisionEncoderDecoderModel.from_pretrained(MODEL) |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| model.to(device).eval() |
| |
| def extract(image_path): |
| img = Image.open(image_path).convert("RGB") |
| pixel_values = processor(img, return_tensors="pt").pixel_values.to(device) |
| decoder_input_ids = torch.tensor([[model.config.decoder_start_token_id]]).to(device) |
| |
| with torch.no_grad(): |
| outputs = model.generate( |
| pixel_values, |
| decoder_input_ids=decoder_input_ids, |
| max_length=512, |
| pad_token_id=processor.tokenizer.pad_token_id, |
| eos_token_id=processor.tokenizer.eos_token_id, |
| use_cache=True, |
| bad_words_ids=[[processor.tokenizer.unk_token_id]], |
| ) |
| |
| seq = processor.tokenizer.batch_decode(outputs.sequences)[0] |
| seq = seq.replace(processor.tokenizer.eos_token, "").replace( |
| processor.tokenizer.pad_token, "" |
| ) |
| seq = seq.replace( |
| processor.tokenizer.decode([model.config.decoder_start_token_id]), "" |
| ).strip() |
| |
| return json.loads(seq) |
| |
| result = extract("my_receipt.jpg") |
| print(json.dumps(result, indent=2)) |
| ``` |
|
|
| ### Batch Inference |
|
|
| ```python |
| from glob import glob |
| |
| receipts = glob("receipts/*.jpg") |
| results = [extract(r) for r in receipts] |
| |
| # Save to JSON |
| with open("batch_results.json", "w") as f: |
| json.dump(results, f, indent=2) |
| ``` |
|
|
| --- |
|
|
| ## Model Architecture |
|
|
| ``` |
| Input Image (1536Γ1152) |
| β |
| Swin Transformer Encoder |
| β |
| Encoder Hidden States |
| β |
| BART Decoder (cross-attention) |
| β |
| JSON Text Tokens |
| ``` |
|
|
| - **Encoder:** Swin Transformer (hierarchical vision backbone) |
| - **Decoder:** BART (text generation with cross-attention) |
| - **Vocabulary:** ~5,000 tokens (includes special receipt tokens) |
| - **Parameters:** ~300M total |
|
|
| ### Why Donut? |
|
|
| | Feature | OCR + NER Pipeline | Donut (End-to-End) | |
| |---------|-------------------|-------------------| |
| | Errors compound | OCR error β NER fails | Single model, single optimization | |
| | Layout handling | Requires separate layout model | Built into vision encoder | |
| | Speed | Multi-stage, slower | One forward pass | |
| | Maintenance | 3+ models to update | One model, one checkpoint | |
|
|
| --- |
|
|
| ## Limitations |
|
|
| 1. **Resolution:** Works best on receipts with text height β₯ 10 pixels. Very low-res images may fail. |
| 2. **Languages:** Primarily trained on English receipts. Other languages may produce lower accuracy. |
| 3. **Handwriting:** Printed text works best. Cursive handwriting is not well supported. |
| 4. **Field coverage:** Only extracts `merchant`, `date`, `subtotal`, `tax`, `total`, `address`. Line items are not extracted. |
| 5. **Currency normalization:** Outputs raw strings (`$13.63`) β post-processing may be needed to convert to floats. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model in research, please cite: |
|
|
| ```bibtex |
| @misc{receipt_donut_2026, |
| title={Receipt Donut: Fine-tuned Document Understanding for Receipt Extraction}, |
| author={Awarebeyond}, |
| year={2026}, |
| howpublished={\url{https://huggingface.co/Awarebeyond/receipt-donut}} |
| } |
| ``` |
|
|
| --- |
|
|
| *Built with β€οΈ by a NAVTTC π΅π° student using Google Cloud Workbench (L4 GPU) and the Hugging Face ecosystem.* |
|
|