File size: 4,247 Bytes
8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 8af85e0 fd4a120 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | ---
library_name: transformers
license: apache-2.0
base_model: google-bert/bert-base-multilingual-cased
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- synthetic-data
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: BERTInvoiceCzechR-V0
results: []
---
# BERTInvoiceCzechR (V0 – Synthetic Templates Only)
This model is a fine-tuned version of [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) for the task of structured information extraction from Czech invoices.
It achieves the following results on the evaluation set:
- Loss: 0.3291
- Precision: 0.5188
- Recall: 0.6917
- F1: 0.5929
- Accuracy: 0.9335
---
## Model description
BERTInvoiceCzechR (V0) is the baseline model in a multi-stage experimental pipeline focused on invoice understanding.
The model performs token-level classification to extract structured fields from invoice text, such as:
- supplier
- customer
- invoice number
- bank details
- totals
- dates
This version (V0) is trained **exclusively on synthetically generated invoices created from predefined templates**, without any layout randomization or real-world data.
---
## Training data
The dataset consists purely of:
- synthetically generated invoices
- fixed template structures
- controlled field placement and formatting
Characteristics:
- consistent layout across samples
- fully controlled annotations
- no noise or OCR artifacts
- no real invoice data
- added synthetic image augmentations
This dataset represents the **simplest training scenario** in the pipeline and serves as a baseline for comparison with more complex data variants.
---
## Role in the pipeline
This model corresponds to:
**V0 – Synthetic template-based dataset only**
It is used as:
- a baseline for evaluating the impact of:
- layout variability
- synthetic-real hybrid data
- real annotated invoices
- a reference point for measuring generalization gap
---
## Intended uses
- Baseline model for document AI experiments
- Evaluation of synthetic data usefulness
- Comparison with more advanced dataset variants (V1–V3)
- Research in Czech invoice information extraction
---
## Limitations
- Strong dependency on template structure
- May have poor generalization to:
- unseen layouts
- real-world invoices
- noisy OCR outputs
- Does not capture layout variability
- Trained only on clean synthetic data
---
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 2
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP
---
### Training results
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| No log | 1.0 | 87 | 0.3944 | 0.1965 | 0.2233 | 0.2091 | 0.8997 |
| No log | 2.0 | 174 | 0.2951 | 0.4152 | 0.4517 | 0.4327 | 0.9241 |
| No log | 3.0 | 261 | 0.2896 | 0.4790 | 0.5810 | 0.5251 | 0.9314 |
| No log | 4.0 | 348 | 0.3295 | 0.4549 | 0.6443 | 0.5333 | 0.9226 |
| No log | 5.0 | 435 | 0.3249 | 0.4908 | 0.6866 | 0.5724 | 0.9281 |
| 0.3757 | 6.0 | 522 | 0.3615 | 0.4646 | 0.6827 | 0.5529 | 0.9216 |
| 0.3757 | 7.0 | 609 | 0.3376 | 0.4913 | 0.6579 | 0.5625 | 0.9299 |
| 0.3757 | 8.0 | 696 | 0.3290 | 0.5194 | 0.6924 | 0.5935 | 0.9336 |
| 0.3757 | 9.0 | 783 | 0.3604 | 0.4906 | 0.6858 | 0.5720 | 0.9279 |
| 0.3757 | 10.0 | 870 | 0.3515 | 0.5011 | 0.6944 | 0.5821 | 0.9296 |
---
## Framework versions
- Transformers 5.0.0
- PyTorch 2.10.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.2 |