BERTInvoiceCzechR (V3 – Full Pipeline with Real Data Fine-Tuning)

This model is a fine-tuned version of google-bert/bert-base-multilingual-cased for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

Loss: 0.0630
Precision: 0.8620
Recall: 0.9072
F1: 0.8840
Accuracy: 0.9830

Model description

BERTInvoiceCzechR (V3) is the final model in a multi-stage training pipeline designed for invoice understanding.

The model performs token-level classification to extract structured invoice fields:

supplier
customer
invoice number
bank details
totals
dates

This version combines synthetic data, layout augmentation, hybrid data, and real annotated invoices, resulting in the highest performance across all variants.

Training data

The dataset used in this stage is a combination of:

Synthetic template-based invoices (V0)
Synthetic invoices with randomized layouts (V1)
Hybrid invoices with real layouts and synthetic content (V2)
Real annotated invoices

Real data fine-tuning

The final stage introduces:

real invoice documents
manually or semi-automatically annotated fields
natural linguistic variability
real formatting inconsistencies

This allows the model to:

adapt to real-world distributions
learn domain-specific patterns
improve robustness and generalization

Role in the pipeline

This model corresponds to:

V3 – Full pipeline (synthetic + hybrid + real data fine-tuning)

It represents:

the final production-ready model
the culmination of the proposed data generation strategy
the best-performing configuration in the experimental setup

Intended uses

Real-world invoice information extraction
Document AI systems in production environments
OCR post-processing pipelines
Research benchmarking against synthetic-only approaches

Limitations

Performance depends on OCR quality (input text assumption)
May still struggle with:
- highly unusual invoice formats
- extreme noise or low-resolution scans
Requires tokenized text input (not end-to-end from images)
Domain-specific (Czech invoices)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 16
eval_batch_size: 2
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 0.1
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
No log	1.0	20	0.0919	0.7824	0.8645	0.8214	0.9730
No log	2.0	40	0.0701	0.8581	0.8874	0.8725	0.9810
No log	3.0	60	0.0684	0.8481	0.8951	0.8710	0.9809
No log	4.0	80	0.0709	0.8311	0.9060	0.8670	0.9802
No log	5.0	100	0.0634	0.8680	0.8913	0.8795	0.9826
No log	6.0	120	0.0666	0.8479	0.9091	0.8774	0.9818
No log	7.0	140	0.0670	0.8454	0.8983	0.8710	0.9812
No log	8.0	160	0.0632	0.8604	0.9045	0.8819	0.9825
No log	9.0	180	0.0644	0.8593	0.9083	0.8831	0.9828
No log	10.0	200	0.0630	0.8620	0.9072	0.8840	0.9830

Framework versions

Transformers 5.0.0
PyTorch 2.10.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

Downloads last month: 2

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for TomasFAV/BERTInvoiceCzechV0123WORSEF1

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(1000)

this model