--- language: - en tags: - image-to-text - document-ai - donut - receipt-extraction - ocr-free datasets: - Voxel51/scanned_receipts - naver-clova-ix/cord-v2 - docjay131/receipts-ocr-dataset - mychen76/invoices-and-receipts_ocr_v1 - mychen76/invoices-and-receipts_ocr_v2 - mychen76/wildreceipts_ocr_v1 - mychen76/receipt_cord_ocr_v2 - mychen76/ds_receipts_v2_train pipeline_tag: image-to-text widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/receipt.jpg example_title: Sample Receipt --- # 🧾🍩 Receipt Donut β€” Complete Document for Understanding > **Welcome!** This page explains every technical decision so you can understand (and replicate) the full training pipeline. This model extracts structured JSON data directly from receipt images **without** needing a separate OCR engine. It is a fine-tuned version of `naver-clova-ix/donut-base-finetuned-cord-v2` trained on 8,615 real-world receipt images. **Try it live:** [πŸš€ Hugging Face Space](https://huggingface.co/spaces/Awarebeyond/receipt-donut-space) --- ## πŸ“‹ Table of Contents 1. [What is Ground Truth?](#what-is-ground-truth) 2. [Training Configuration (YAML Deep Dive)](#training-configuration-yaml-deep-dive) 3. [Dataset & Train/Test/Val Split](#dataset--traintestval-split) 4. [Training Performance & Learning Curves](#training-performance--learning-curves) 5. [Confusion Matrix & Field-Level Evaluation](#confusion-matrix--field-level-evaluation) 6. [How to Use (Python)](#how-to-use-python) 7. [Model Architecture](#model-architecture) 8. [Limitations](#limitations) --- ## What is Ground Truth? In machine learning, **Ground Truth** is the "correct answer" we teach the model to predict. For receipts, instead of raw OCR text, we use **structured JSON** so the model learns to output clean, labeled data. ### Example Ground Truth ```json { "merchant": "Starbucks Coffee", "date": "2026-03-15", "subtotal": "$12.50", "tax": "$1.13", "total": "$13.63", "address": "123 Main St, New York, NY" } ``` ### Why JSON Ground Truth matters | Approach | Problem | Our Solution | |----------|---------|--------------| | Raw OCR text | No structure β€” you get "Starbucks $13.63" | We label **keys** and **values** | | Fixed template | Fails on receipts with different fields | JSON is flexible and self-describing | | Named Entity Recognition | Requires post-processing pipeline | Donut outputs JSON **directly** | ### How we normalized different datasets Receipt datasets use wildly different formats. We wrote `_normalize_gt()` to unify them: ```python # WildReceipts uses a list of annotations: annotations = [ {"label": "store_name", "transcription": "Walmart"}, {"label": "total_value", "transcription": "$45.20"} ] # CORD uses nested JSON: gt_parse = { "menu": [...], "total": {"price": "$45.20"} } # Our code converts ALL of these into a single normalized format: { "merchant": "Walmart", "total": "$45.20" } ``` We **skip samples with empty ground truth** to prevent the model from learning to output `{}`. --- ## Training Configuration (YAML Deep Dive) Here is the exact `gcp_l4_enterprise.yaml` we used. Each parameter is explained so you understand **why** we chose it. ```yaml model: model_name: "naver-clova-ix/donut-base-finetuned-cord-v2" max_length: 768 image_size: [1536, 1152] # Wider than tall for typical receipts training: output_dir: "./outputs/receipt_donut_gcp_enterprise" num_train_epochs: 20 # Upper limit; early stopping at epoch 9 batch_size: 4 # Fits in L4 24GB VRAM gradient_accumulation_steps: 16 # Effective batch = 4 Γ— 16 = 64 learning_rate: 8.0e-5 # Higher LR for larger effective batch weight_decay: 0.01 # Prevents overfitting warmup_ratio: 0.05 # 5% of steps warm up LR from 0 bf16: true # L4 GPU has native BFloat16 support gradient_checkpointing: true # Trade compute for memory; enables larger batches label_smoothing: 0.1 # Softens targets; prevents overconfident predictions freeze_encoder_epochs: 1 # Train only decoder first (faster convergence) cosine_restart_epochs: 5 # LR schedule restarts every 5 epochs grayscale: true # Reduces domain gap between color/gray receipts num_workers: 8 # Parallel data loading (L4 has 8 CPU cores) data: dataset_root: "./receipt_datasets" train_split: 0.95 # 95% training val_split: 0.025 # 2.5% validation test_split: 0.025 # 2.5% holdout test seed: 42 include_datasets: - "Voxel51__scanned_receipts" - "naver-clova-ix__cord-v2" - "docjay131__receipts-ocr-dataset" - "mychen76__invoices-and-receipts_ocr_v1" - "mychen76__invoices-and-receipts_ocr_v2" - "mychen76__wildreceipts_ocr_v1" - "mychen76__receipt_cord_ocr_v2" - "mychen76__ds_receipts_v2_train" augmentation: enabled: true rotation_limit: 20 # Simulates tilted camera photos brightness_limit: 0.3 # Different lighting conditions contrast_limit: 0.3 blur_prob: 0.5 # Camera shake / focus blur noise_prob: 0.5 # ISO noise in dark restaurants perspective_prob: 0.6 # Receipts photographed at an angle quality_lower: 40 # JPEG compression artifacts quality_upper: 100 ``` ### Key Concepts Explained **Gradient Accumulation:** We process 4 images at a time, but accumulate gradients over 16 steps before updating weights. This gives us the stability of batch size 64 without needing 64Γ— the GPU memory. **BFloat16 (bf16):** A half-precision number format. The L4 GPU has native bf16 hardware, so training is ~2Γ— faster and uses ~half the memory compared to fp32, with almost no accuracy loss. **Gradient Checkpointing:** Instead of storing all intermediate activations in memory, we recompute them during backward pass. This lets us fit a bigger model/batch at the cost of ~20% slower training. **Label Smoothing:** Normally the model is told "this token is 100% correct." With smoothing, we say "this token is 90% correct, others share the remaining 10%." This prevents the model from becoming overconfident. --- ## Dataset & Train/Test/Val Split ### Data Sources (8 Datasets, ~8,615 labeled samples) | Dataset | Type | Approx. Samples | Notes | |---------|------|-----------------|-------| | CORD-v2 | Structured | ~800 | Clean, high-quality receipts | | WildReceipts | List annotations | ~2,000 | Noisy real-world scans | | Scanned Receipts | Image + OCR | ~1,000 | Voxel51 collection | | Invoices & Receipts v1/v2 | Mixed | ~2,500 | mychen76 datasets | | Receipt CORD OCR v2 | OCR pairs | ~1,000 | Double-escaped JSON (we fixed parsing) | | DS Receipts v2 Train | Synthetic | ~1,000 | Also had double-escaped strings | ### Split Ratios ``` Total: 8,615 samples β”œβ”€β”€ Train: 8,184 (95%) β”œβ”€β”€ Val: 215 (2.5%) β†’ Used to pick the best checkpoint └── Test: 215 (2.5%) β†’ Holdout set, never seen during training ``` We used a **single unified dataset loader** (`UnifiedReceiptDataset`) so all 8 datasets are mixed and shuffled together. This prevents the model from overfitting to any one receipt style. ### Why these splits? - **95% train:** With <10k samples, we need as much training data as possible. - **2.5% val:** Just enough to detect overfitting without wasting data. - **2.5% test:** Final unbiased evaluation. In practice, we also evaluated visually on unseen real receipts. --- ## Training Performance & Learning Curves ### Loss Curve ![Learning Curve](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/learning_curve.png) The model converged around **Epoch 9**. Training was stopped early because: - Validation loss plateaued - No improvement for 3 consecutive epochs - Further training risked overfitting ### Key Metrics | Metric | Value | |--------|-------| | Total training samples | 8,615 | | Effective batch size | 64 | | Peak learning rate | 8.0e-5 | | Training precision | bf16 | | GPU | NVIDIA L4 (24 GB VRAM) | | Training duration | ~10 hours actual (+ ~12 hours trial/error) | | Early stopping epoch | 9 / 20 | ### Sample Visual Results Below are real model outputs on the validation set (Original Image vs. Predicted JSON). ![Sample 1](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_0.png) *Example 1: Correctly extracted merchant, date, and total.* ![Sample 2](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_1.png) *Example 2: Handled a partially blurred receipt with minor date typo.* ![Sample 3](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/sample_result_2.png) *Example 3: Multi-line address and tax amount correctly parsed.* --- ## Confusion Matrix & Field-Level Evaluation Since this is a **generative text model** (not a classifier), a traditional confusion matrix doesn't apply. Instead, we evaluate each extracted field with a **Field-Level Confusion Matrix** based on string similarity. ### Evaluation Categories | Category | Criteria | Example | |----------|----------|---------| | βœ… **Correct** | 100% character match | `$13.63` == `$13.63` | | ⚠️ **Minor Typo** | < 20% Levenshtein distance | `Starbuks` vs `Starbucks` | | ❌ **Incorrect** | > 20% distance or missing | `null` vs `Walmart` | ### Field-Level Confusion Matrix (Test Set β€” 597 Samples) | Field | Correct | Minor Typo | Incorrect | Notes | |-------|---------|------------|-----------|-------| | `merchant` | **70.9%** (423/597) | 8.5% (51) | 20.6% (123) | Store names vary wildly in format | | `date` | **86.9%** (519/597) | 1.0% (6) | 12.1% (72) | Highly consistent format | | `subtotal` | **71.7%** (428/597) | 2.3% (14) | 26.0% (155) | Often missing on simple receipts | | `tax` | **86.4%** (516/597) | 0.0% (0) | 13.6% (81) | Usually present when subtotal is | | `total` | **47.4%** (283/597) | 7.9% (47) | 44.7% (267) | **Hardest field** β€” model confuses it with subtotal | | `address` | **100.0%** (597/597) | 0.0% (0) | 0.0% (0) | **Test set has 0 address labels** β€” model correctly abstains | ![Field Confusion Matrix](https://huggingface.co/Awarebeyond/receipt-donut/resolve/main/hub_assets/field_confusion_matrix.png) ### Overall Performance ``` Exact Match (all fields correct): 32.8% (196/597) Usable Match (≀1 minor typo): 61.1% (365/597) Any Incorrect Field: 38.9% (232/597) ``` > **Key insight 1:** The `total` field is the model's biggest weakness at 47.4% correct. This is because `total` and `subtotal` are visually similar numbers on receipts, and the model sometimes swaps them. Improving this would require stronger positional cues or a post-processing rule (always pick the larger number). > **Key insight 2:** `address` at 100% is **not meaningful** β€” address labels are completely absent from the 5 test datasets (CORD, WildReceipts, etc. don't include address). The model correctly learned not to hallucinate it. > **Why is Exact Match only 32.8%?** Receipt OCR is genuinely hard. The test datasets (CORD, WildReceipts, etc.) use different JSON schemas and raw output formats. The model learns normalized fields, but raw GT contains keys like `total_price`, `cashprice`, `changeprice` that don't align perfectly. The model is still useful β€” **61.1%** of receipts are "usable" with at most one small typo. ### Generating the Confusion Matrix Yourself Run this on your Workbench to reproduce the evaluation: ```bash python scripts/evaluate_model.py \ --model_path outputs/receipt_donut_gcp_enterprise/best_model \ --dataset_root receipt_datasets \ --output_dir evaluation_results ``` This outputs: - `confusion_matrix.png` β€” Visual matrix per field - `field_accuracy.json` β€” Numerical breakdown - `error_analysis.html` β€” Side-by-side failures --- ## How to Use (Python) ### Installation ```bash pip install transformers Pillow torch ``` ### Single Image Inference ```python import torch from transformers import DonutProcessor, VisionEncoderDecoderModel from PIL import Image MODEL = "Awarebeyond/receipt-donut" processor = DonutProcessor.from_pretrained(MODEL) model = VisionEncoderDecoderModel.from_pretrained(MODEL) device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device).eval() def extract(image_path): img = Image.open(image_path).convert("RGB") pixel_values = processor(img, return_tensors="pt").pixel_values.to(device) decoder_input_ids = torch.tensor([[model.config.decoder_start_token_id]]).to(device) with torch.no_grad(): outputs = model.generate( pixel_values, decoder_input_ids=decoder_input_ids, max_length=512, pad_token_id=processor.tokenizer.pad_token_id, eos_token_id=processor.tokenizer.eos_token_id, use_cache=True, bad_words_ids=[[processor.tokenizer.unk_token_id]], ) seq = processor.tokenizer.batch_decode(outputs.sequences)[0] seq = seq.replace(processor.tokenizer.eos_token, "").replace( processor.tokenizer.pad_token, "" ) seq = seq.replace( processor.tokenizer.decode([model.config.decoder_start_token_id]), "" ).strip() return json.loads(seq) result = extract("my_receipt.jpg") print(json.dumps(result, indent=2)) ``` ### Batch Inference ```python from glob import glob receipts = glob("receipts/*.jpg") results = [extract(r) for r in receipts] # Save to JSON with open("batch_results.json", "w") as f: json.dump(results, f, indent=2) ``` --- ## Model Architecture ``` Input Image (1536Γ—1152) ↓ Swin Transformer Encoder ↓ Encoder Hidden States ↓ BART Decoder (cross-attention) ↓ JSON Text Tokens ``` - **Encoder:** Swin Transformer (hierarchical vision backbone) - **Decoder:** BART (text generation with cross-attention) - **Vocabulary:** ~5,000 tokens (includes special receipt tokens) - **Parameters:** ~300M total ### Why Donut? | Feature | OCR + NER Pipeline | Donut (End-to-End) | |---------|-------------------|-------------------| | Errors compound | OCR error β†’ NER fails | Single model, single optimization | | Layout handling | Requires separate layout model | Built into vision encoder | | Speed | Multi-stage, slower | One forward pass | | Maintenance | 3+ models to update | One model, one checkpoint | --- ## Limitations 1. **Resolution:** Works best on receipts with text height β‰₯ 10 pixels. Very low-res images may fail. 2. **Languages:** Primarily trained on English receipts. Other languages may produce lower accuracy. 3. **Handwriting:** Printed text works best. Cursive handwriting is not well supported. 4. **Field coverage:** Only extracts `merchant`, `date`, `subtotal`, `tax`, `total`, `address`. Line items are not extracted. 5. **Currency normalization:** Outputs raw strings (`$13.63`) β€” post-processing may be needed to convert to floats. --- ## Citation If you use this model in research, please cite: ```bibtex @misc{receipt_donut_2026, title={Receipt Donut: Fine-tuned Document Understanding for Receipt Extraction}, author={Awarebeyond}, year={2026}, howpublished={\url{https://huggingface.co/Awarebeyond/receipt-donut}} } ``` --- *Built with ❀️ by a NAVTTC πŸ‡΅πŸ‡° student using Google Cloud Workbench (L4 GPU) and the Hugging Face ecosystem.*