BERTInvoiceCzechR (V3 – Full Pipeline with Real Data Fine-Tuning)
This model is a fine-tuned version of google-bert/bert-base-multilingual-cased for structured information extraction from Czech invoices.
It achieves the following results on the evaluation set:
- Loss: 0.0630
- Precision: 0.8620
- Recall: 0.9072
- F1: 0.8840
- Accuracy: 0.9830
Model description
BERTInvoiceCzechR (V3) is the final model in a multi-stage training pipeline designed for invoice understanding.
The model performs token-level classification to extract structured invoice fields:
- supplier
- customer
- invoice number
- bank details
- totals
- dates
This version combines synthetic data, layout augmentation, hybrid data, and real annotated invoices, resulting in the highest performance across all variants.
Training data
The dataset used in this stage is a combination of:
- Synthetic template-based invoices (V0)
- Synthetic invoices with randomized layouts (V1)
- Hybrid invoices with real layouts and synthetic content (V2)
- Real annotated invoices
Real data fine-tuning
The final stage introduces:
- real invoice documents
- manually or semi-automatically annotated fields
- natural linguistic variability
- real formatting inconsistencies
This allows the model to:
- adapt to real-world distributions
- learn domain-specific patterns
- improve robustness and generalization
Role in the pipeline
This model corresponds to:
V3 – Full pipeline (synthetic + hybrid + real data fine-tuning)
It represents:
- the final production-ready model
- the culmination of the proposed data generation strategy
- the best-performing configuration in the experimental setup
Intended uses
- Real-world invoice information extraction
- Document AI systems in production environments
- OCR post-processing pipelines
- Research benchmarking against synthetic-only approaches
Limitations
- Performance depends on OCR quality (input text assumption)
- May still struggle with:
- highly unusual invoice formats
- extreme noise or low-resolution scans
- Requires tokenized text input (not end-to-end from images)
- Domain-specific (Czech invoices)
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 2
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP
Training results
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|---|---|---|
| No log | 1.0 | 20 | 0.0919 | 0.7824 | 0.8645 | 0.8214 | 0.9730 |
| No log | 2.0 | 40 | 0.0701 | 0.8581 | 0.8874 | 0.8725 | 0.9810 |
| No log | 3.0 | 60 | 0.0684 | 0.8481 | 0.8951 | 0.8710 | 0.9809 |
| No log | 4.0 | 80 | 0.0709 | 0.8311 | 0.9060 | 0.8670 | 0.9802 |
| No log | 5.0 | 100 | 0.0634 | 0.8680 | 0.8913 | 0.8795 | 0.9826 |
| No log | 6.0 | 120 | 0.0666 | 0.8479 | 0.9091 | 0.8774 | 0.9818 |
| No log | 7.0 | 140 | 0.0670 | 0.8454 | 0.8983 | 0.8710 | 0.9812 |
| No log | 8.0 | 160 | 0.0632 | 0.8604 | 0.9045 | 0.8819 | 0.9825 |
| No log | 9.0 | 180 | 0.0644 | 0.8593 | 0.9083 | 0.8831 | 0.9828 |
| No log | 10.0 | 200 | 0.0630 | 0.8620 | 0.9072 | 0.8840 | 0.9830 |
Framework versions
- Transformers 5.0.0
- PyTorch 2.10.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.2
- Downloads last month
- 93
Model tree for TomasFAV/BERTInvoiceCzechV0123WORSEF1
Base model
google-bert/bert-base-multilingual-cased