BERTInvoiceCzechR (V3 – Full Pipeline with Real Data Fine-Tuning)

This model is a fine-tuned version of google-bert/bert-base-multilingual-cased for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

  • Loss: 0.0630
  • Precision: 0.8620
  • Recall: 0.9072
  • F1: 0.8840
  • Accuracy: 0.9830

Model description

BERTInvoiceCzechR (V3) is the final model in a multi-stage training pipeline designed for invoice understanding.

The model performs token-level classification to extract structured invoice fields:

  • supplier
  • customer
  • invoice number
  • bank details
  • totals
  • dates

This version combines synthetic data, layout augmentation, hybrid data, and real annotated invoices, resulting in the highest performance across all variants.


Training data

The dataset used in this stage is a combination of:

  1. Synthetic template-based invoices (V0)
  2. Synthetic invoices with randomized layouts (V1)
  3. Hybrid invoices with real layouts and synthetic content (V2)
  4. Real annotated invoices

Real data fine-tuning

The final stage introduces:

  • real invoice documents
  • manually or semi-automatically annotated fields
  • natural linguistic variability
  • real formatting inconsistencies

This allows the model to:

  • adapt to real-world distributions
  • learn domain-specific patterns
  • improve robustness and generalization

Role in the pipeline

This model corresponds to:

V3 – Full pipeline (synthetic + hybrid + real data fine-tuning)

It represents:

  • the final production-ready model
  • the culmination of the proposed data generation strategy
  • the best-performing configuration in the experimental setup

Intended uses

  • Real-world invoice information extraction
  • Document AI systems in production environments
  • OCR post-processing pipelines
  • Research benchmarking against synthetic-only approaches

Limitations

  • Performance depends on OCR quality (input text assumption)
  • May still struggle with:
    • highly unusual invoice formats
    • extreme noise or low-resolution scans
  • Requires tokenized text input (not end-to-end from images)
  • Domain-specific (Czech invoices)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 16
  • eval_batch_size: 2
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 0.1
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
No log 1.0 20 0.0919 0.7824 0.8645 0.8214 0.9730
No log 2.0 40 0.0701 0.8581 0.8874 0.8725 0.9810
No log 3.0 60 0.0684 0.8481 0.8951 0.8710 0.9809
No log 4.0 80 0.0709 0.8311 0.9060 0.8670 0.9802
No log 5.0 100 0.0634 0.8680 0.8913 0.8795 0.9826
No log 6.0 120 0.0666 0.8479 0.9091 0.8774 0.9818
No log 7.0 140 0.0670 0.8454 0.8983 0.8710 0.9812
No log 8.0 160 0.0632 0.8604 0.9045 0.8819 0.9825
No log 9.0 180 0.0644 0.8593 0.9083 0.8831 0.9828
No log 10.0 200 0.0630 0.8620 0.9072 0.8840 0.9830

Framework versions

  • Transformers 5.0.0
  • PyTorch 2.10.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
93
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/BERTInvoiceCzechV0123WORSEF1

Finetuned
(953)
this model