File size: 3,709 Bytes
3bb10e7
 
 
445d706
3bb10e7
 
07ca27c
 
 
 
 
 
 
fc6fb86
 
3bb10e7
07ca27c
3bb10e7
 
 
07ca27c
3bb10e7
07ca27c
3bb10e7
 
07ca27c
 
 
 
3bb10e7
 
 
07ca27c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3bb10e7
07ca27c
3bb10e7
07ca27c
 
 
 
 
 
 
 
3bb10e7
07ca27c
 
 
 
 
 
 
 
3bb10e7
07ca27c
 
 
 
 
 
 
 
 
3bb10e7
 
 
 
 
 
445d706
fc6fb86
445d706
3bb10e7
 
445d706
fc6fb86
 
3bb10e7
 
07ca27c
 
fc6fb86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07ca27c
fc6fb86
07ca27c
3bb10e7
07ca27c
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
library_name: transformers
license: apache-2.0
base_model: google/pix2struct-docvqa-base
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- document-ai
- multimodal-model
- generative-model
- synthetic-data
metrics:
- f1
model-index:
- name: Pix2StructCzechInvoice-V0
  results: []
---

# Pix2StructCzechInvoice (V0 – Synthetic Templates Only)

This model is a fine-tuned version of [google/pix2struct-docvqa-base](https://huggingface.co/google/pix2struct-docvqa-base) for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:
- Loss: 0.5022  
- F1: 0.5907  

---

## Model description

Pix2StructCzechInvoice (V0) is a generative multimodal model designed for document understanding.

Unlike token classification models (e.g., BERT, LiLT, LayoutLMv3), this model:
- processes the entire document image  
- generates structured outputs as text sequences  

The model is trained to extract key invoice fields such as:
- supplier  
- customer  
- invoice number  
- bank details  
- totals  
- dates  

---

## Training data

The dataset consists of:

- synthetically generated invoice images  
- fixed template layouts  
- corresponding target text sequences representing structured fields  

Key properties:
- clean and consistent visual structure  
- no OCR noise (end-to-end image input)  
- controlled output formatting  
- no real-world documents  

This represents the **baseline dataset for generative multimodal models**.

---

## Role in the pipeline

This model corresponds to:

**V0 – Synthetic template-based dataset only**

It is used to:
- establish a baseline for generative document models  
- compare with:
  - token classification approaches (BERT, LiLT)  
  - multimodal encoders (LayoutLMv3)  
- evaluate feasibility of end-to-end extraction  

---

## Intended uses

- End-to-end invoice information extraction from images  
- Document VQA-style tasks  
- Research in generative document understanding  
- Comparison with structured prediction approaches  

---

## Limitations

- Trained only on synthetic data  
- Sensitive to output formatting inconsistencies  
- Lower stability compared to token classification models  
- Requires careful evaluation (string matching vs structured metrics)  
- Performance depends on generation quality  

---

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 1
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine_with_restarts
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

---

### Training results

| Training Loss | Epoch | Step | Validation Loss | F1     |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 3.1072        | 1.0   | 300  | 2.9769          | 0.0    |
| 2.6572        | 2.0   | 600  | 2.8684          | 0.0    |
| 2.4810        | 3.0   | 900  | 2.6349          | 0.0    |
| 1.7941        | 4.0   | 1200 | 1.6395          | 0.0    |
| 0.8458        | 5.0   | 1500 | 1.0680          | 0.2173 |
| 0.6198        | 6.0   | 1800 | 0.7713          | 0.4835 |
| 0.1999        | 7.0   | 2100 | 0.4331          | 0.5700 |
| 0.0946        | 8.0   | 2400 | 0.3844          | 0.5907 |
| 0.1020        | 9.0   | 2700 | 0.4066          | 0.4294 |
| 0.0842        | 10.0  | 3000 | 0.5022          | 0.4665 |

---

## Framework versions

- Transformers 5.0.0  
- PyTorch 2.10.0+cu128  
- Datasets 4.0.0  
- Tokenizers 0.22.2