DonutInvoiceCzech (V3 – Full Pipeline with Real Data Fine-Tuning, skipping V1)

This model is a fine-tuned version of naver-clova-ix/donut-base-finetuned-cord-v2 for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

  • Loss: 0.2358
  • Mean Accuracy: 0.9375
  • F1: 0.8910

Model description

DonutInvoiceCzech (V3) is the final OCR-free generative model in the experimental pipeline.

The model:

  • processes raw document images
  • directly generates structured outputs
  • does not require OCR

It extracts key invoice fields:

  • supplier
  • customer
  • invoice number
  • bank details
  • totals
  • dates

This version achieves strong performance by combining multimodal generative modeling with realistic training data.


Training data

The dataset used in this stage combines:

  1. Synthetic template-based invoices (V0)
  2. Hybrid invoices with real layouts and synthetic content (V2)
  3. Real annotated invoices

⚠️ Important:
The randomized layout stage (V1) was intentionally omitted, as it degraded performance in earlier experiments.

Real data fine-tuning

The final stage introduces:

  • real invoice images
  • natural visual noise and distortions
  • real formatting inconsistencies
  • realistic language variability

This enables:

  • improved generation consistency
  • better alignment with real-world distributions
  • reduced hallucination and formatting errors

Role in the pipeline

This model corresponds to:

V3 – Full pipeline (V0 + V2 + real data, skipping V1)

It represents:

  • the final Donut model
  • the best-performing OCR-free approach
  • a strong end-to-end document understanding solution

Intended uses

  • OCR-free invoice information extraction
  • End-to-end document understanding systems
  • Real-world document AI pipelines
  • Research in generative document models

Limitations

  • Output still requires structured post-processing
  • Sensitive to decoding strategy
  • Computationally expensive
  • Performance tied to visual similarity of training data
  • Less interpretable than token classification models

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 9e-05
  • train_batch_size: 4
  • eval_batch_size: 1
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Mean Accuracy F1
0.2117 1.0 46 0.1976 0.9488 0.8749
0.1087 2.0 92 0.1964 0.9254 0.8671
0.0606 3.0 138 0.2166 0.9214 0.8608
0.0319 4.0 184 0.2053 0.9365 0.8801
0.0142 5.0 230 0.2109 0.9439 0.8868
0.0049 6.0 276 0.2254 0.9284 0.8814
0.0065 7.0 322 0.2313 0.9369 0.8886
0.0118 8.0 368 0.2358 0.9375 0.8910
0.0035 9.0 414 0.2398 0.9371 0.8910
0.0063 10.0 460 0.2414 0.9336 0.8889

Framework versions

  • Transformers 5.0.0
  • PyTorch 2.10.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
97
Safetensors
Model size
0.2B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/DonutInvoiceCzechV0123

Finetuned
(40)
this model