File size: 4,518 Bytes

---
library_name: transformers
license: apache-2.0
base_model: google-bert/bert-base-multilingual-cased
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- synthetic-data
- hybrid-data
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: BERTInvoiceCzechR-V2
  results: []
---

# BERTInvoiceCzechR (V2 – Synthetic + Random Layout + Real Layout Injection)

This model is a fine-tuned version of [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:
- Loss: 0.1326  
- Precision: 0.8120  
- Recall: 0.7868  
- F1: 0.7992  
- Accuracy: 0.9700  

---

## Model description

BERTInvoiceCzechR (V2) represents an advanced stage in the training pipeline, combining synthetic data with realistic document layouts.

The model performs token-level classification to extract structured invoice fields:
- supplier  
- customer  
- invoice number  
- bank details  
- totals  
- dates  

This version introduces a key improvement: **real invoice layouts with synthetic content**, bridging the gap between artificial and real-world data.

---

## Training data

The dataset is composed of three main components:

1. **Synthetic template-based invoices**  
2. **Synthetic invoices with randomized layouts**  
3. **Hybrid invoices with real layouts and synthetic content**  

### Real layout injection

In the hybrid dataset:
- real invoice documents are used as layout templates  
- original textual content is removed  
- fields (e.g., supplier, customer, bank details) are replaced with synthetic data  
- new content is rendered into the original spatial structure  

This approach preserves:
- realistic spacing  
- typography patterns  
- structural complexity  

while maintaining:
- full control over annotations  
- label consistency  

---

## Role in the pipeline

This model corresponds to:

**V2 – Synthetic + layout augmentation + real layout injection**

It is designed to:
- reduce the domain gap between synthetic and real invoices  
- evaluate the impact of realistic spatial distributions  
- serve as a bridge between purely synthetic training (V0–V1) and real data fine-tuning (V3)  

---

## Intended uses

- Advanced research in document AI  
- Evaluation of hybrid synthetic-real training strategies  
- Invoice information extraction in semi-realistic conditions  
- Benchmarking generalization improvements  

---

## Limitations

- Still does not use fully real textual content  
- Synthetic text may not capture all linguistic variability  
- OCR noise and scanning artifacts are not fully represented  
- Performance may still drop on unseen real-world edge cases  

---

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 2
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

---

### Training results

| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| No log        | 1.0   | 87   | 0.1326          | 0.7356    | 0.7270 | 0.7312 | 0.9636   |
| No log        | 2.0   | 174  | 0.1226          | 0.7985    | 0.7604 | 0.7790 | 0.9704   |
| No log        | 3.0   | 261  | 0.1224          | 0.7880    | 0.7852 | 0.7866 | 0.9689   |
| No log        | 4.0   | 348  | 0.1325          | 0.7557    | 0.7783 | 0.7668 | 0.9657   |
| No log        | 5.0   | 435  | 0.1390          | 0.7655    | 0.8229 | 0.7932 | 0.9674   |
| 0.0733        | 6.0   | 522  | 0.1324          | 0.7709    | 0.8155 | 0.7926 | 0.9682   |
| 0.0733        | 7.0   | 609  | 0.1326          | 0.8123    | 0.7868 | 0.7994 | 0.9700   |
| 0.0733        | 8.0   | 696  | 0.1366          | 0.8109    | 0.7775 | 0.7938 | 0.9697   |
| 0.0733        | 9.0   | 783  | 0.1385          | 0.7893    | 0.7930 | 0.7912 | 0.9686   |
| 0.0733        | 10.0  | 870  | 0.1393          | 0.8044    | 0.7938 | 0.7991 | 0.9696   |

---

## Framework versions

- Transformers 5.0.0  
- PyTorch 2.10.0+cu128  
- Datasets 4.0.0  
- Tokenizers 0.22.2