File size: 4,247 Bytes
8af85e0
 
 
 
 
 
fd4a120
 
 
 
8af85e0
 
 
 
 
 
fd4a120
8af85e0
 
 
fd4a120
8af85e0
fd4a120
8af85e0
 
fd4a120
 
 
 
 
 
 
8af85e0
 
 
fd4a120
8af85e0
fd4a120
 
 
 
 
 
 
8af85e0
fd4a120
 
 
8af85e0
fd4a120
8af85e0
fd4a120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8af85e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fd4a120
 
8af85e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fd4a120
8af85e0
fd4a120
8af85e0
fd4a120
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
library_name: transformers
license: apache-2.0
base_model: google-bert/bert-base-multilingual-cased
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- synthetic-data
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: BERTInvoiceCzechR-V0
  results: []
---

# BERTInvoiceCzechR (V0 – Synthetic Templates Only)

This model is a fine-tuned version of [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) for the task of structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:
- Loss: 0.3291  
- Precision: 0.5188  
- Recall: 0.6917  
- F1: 0.5929  
- Accuracy: 0.9335  

---

## Model description

BERTInvoiceCzechR (V0) is the baseline model in a multi-stage experimental pipeline focused on invoice understanding.

The model performs token-level classification to extract structured fields from invoice text, such as:
- supplier  
- customer  
- invoice number  
- bank details  
- totals  
- dates  

This version (V0) is trained **exclusively on synthetically generated invoices created from predefined templates**, without any layout randomization or real-world data.

---

## Training data

The dataset consists purely of:

- synthetically generated invoices  
- fixed template structures  
- controlled field placement and formatting  

Characteristics:
- consistent layout across samples  
- fully controlled annotations  
- no noise or OCR artifacts  
- no real invoice data  
- added synthetic image augmentations
  
This dataset represents the **simplest training scenario** in the pipeline and serves as a baseline for comparison with more complex data variants.

---

## Role in the pipeline

This model corresponds to:

**V0 – Synthetic template-based dataset only**

It is used as:
- a baseline for evaluating the impact of:
  - layout variability  
  - synthetic-real hybrid data  
  - real annotated invoices  
- a reference point for measuring generalization gap  

---

## Intended uses

- Baseline model for document AI experiments  
- Evaluation of synthetic data usefulness  
- Comparison with more advanced dataset variants (V1–V3)  
- Research in Czech invoice information extraction  

---

## Limitations

- Strong dependency on template structure  
- May have poor generalization to:
  - unseen layouts  
  - real-world invoices  
  - noisy OCR outputs  
- Does not capture layout variability  
- Trained only on clean synthetic data  

---

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 2
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

---

### Training results

| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| No log        | 1.0   | 87   | 0.3944          | 0.1965    | 0.2233 | 0.2091 | 0.8997   |
| No log        | 2.0   | 174  | 0.2951          | 0.4152    | 0.4517 | 0.4327 | 0.9241   |
| No log        | 3.0   | 261  | 0.2896          | 0.4790    | 0.5810 | 0.5251 | 0.9314   |
| No log        | 4.0   | 348  | 0.3295          | 0.4549    | 0.6443 | 0.5333 | 0.9226   |
| No log        | 5.0   | 435  | 0.3249          | 0.4908    | 0.6866 | 0.5724 | 0.9281   |
| 0.3757        | 6.0   | 522  | 0.3615          | 0.4646    | 0.6827 | 0.5529 | 0.9216   |
| 0.3757        | 7.0   | 609  | 0.3376          | 0.4913    | 0.6579 | 0.5625 | 0.9299   |
| 0.3757        | 8.0   | 696  | 0.3290          | 0.5194    | 0.6924 | 0.5935 | 0.9336   |
| 0.3757        | 9.0   | 783  | 0.3604          | 0.4906    | 0.6858 | 0.5720 | 0.9279   |
| 0.3757        | 10.0  | 870  | 0.3515          | 0.5011    | 0.6944 | 0.5821 | 0.9296   |

---

## Framework versions

- Transformers 5.0.0  
- PyTorch 2.10.0+cu128  
- Datasets 4.0.0  
- Tokenizers 0.22.2