--- library_name: transformers license: apache-2.0 base_model: google-bert/bert-base-multilingual-cased tags: - generated_from_trainer - invoice-processing - information-extraction - czech-language - synthetic-data - hybrid-data metrics: - precision - recall - f1 - accuracy model-index: - name: BERTInvoiceCzechR-V2 results: [] --- # BERTInvoiceCzechR (V2 – Synthetic + Random Layout + Real Layout Injection) This model is a fine-tuned version of [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) for structured information extraction from Czech invoices. It achieves the following results on the evaluation set: - Loss: 0.1326 - Precision: 0.8120 - Recall: 0.7868 - F1: 0.7992 - Accuracy: 0.9700 --- ## Model description BERTInvoiceCzechR (V2) represents an advanced stage in the training pipeline, combining synthetic data with realistic document layouts. The model performs token-level classification to extract structured invoice fields: - supplier - customer - invoice number - bank details - totals - dates This version introduces a key improvement: **real invoice layouts with synthetic content**, bridging the gap between artificial and real-world data. --- ## Training data The dataset is composed of three main components: 1. **Synthetic template-based invoices** 2. **Synthetic invoices with randomized layouts** 3. **Hybrid invoices with real layouts and synthetic content** ### Real layout injection In the hybrid dataset: - real invoice documents are used as layout templates - original textual content is removed - fields (e.g., supplier, customer, bank details) are replaced with synthetic data - new content is rendered into the original spatial structure This approach preserves: - realistic spacing - typography patterns - structural complexity while maintaining: - full control over annotations - label consistency --- ## Role in the pipeline This model corresponds to: **V2 – Synthetic + layout augmentation + real layout injection** It is designed to: - reduce the domain gap between synthetic and real invoices - evaluate the impact of realistic spatial distributions - serve as a bridge between purely synthetic training (V0–V1) and real data fine-tuning (V3) --- ## Intended uses - Advanced research in document AI - Evaluation of hybrid synthetic-real training strategies - Invoice information extraction in semi-realistic conditions - Benchmarking generalization improvements --- ## Limitations - Still does not use fully real textual content - Synthetic text may not capture all linguistic variability - OCR noise and scanning artifacts are not fully represented - Performance may still drop on unseen real-world edge cases --- ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 16 - eval_batch_size: 2 - seed: 42 - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 0.1 - num_epochs: 10 - mixed_precision_training: Native AMP --- ### Training results | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:| | No log | 1.0 | 87 | 0.1326 | 0.7356 | 0.7270 | 0.7312 | 0.9636 | | No log | 2.0 | 174 | 0.1226 | 0.7985 | 0.7604 | 0.7790 | 0.9704 | | No log | 3.0 | 261 | 0.1224 | 0.7880 | 0.7852 | 0.7866 | 0.9689 | | No log | 4.0 | 348 | 0.1325 | 0.7557 | 0.7783 | 0.7668 | 0.9657 | | No log | 5.0 | 435 | 0.1390 | 0.7655 | 0.8229 | 0.7932 | 0.9674 | | 0.0733 | 6.0 | 522 | 0.1324 | 0.7709 | 0.8155 | 0.7926 | 0.9682 | | 0.0733 | 7.0 | 609 | 0.1326 | 0.8123 | 0.7868 | 0.7994 | 0.9700 | | 0.0733 | 8.0 | 696 | 0.1366 | 0.8109 | 0.7775 | 0.7938 | 0.9697 | | 0.0733 | 9.0 | 783 | 0.1385 | 0.7893 | 0.7930 | 0.7912 | 0.9686 | | 0.0733 | 10.0 | 870 | 0.1393 | 0.8044 | 0.7938 | 0.7991 | 0.9696 | --- ## Framework versions - Transformers 5.0.0 - PyTorch 2.10.0+cu128 - Datasets 4.0.0 - Tokenizers 0.22.2