File size: 3,709 Bytes
3bb10e7 445d706 3bb10e7 07ca27c fc6fb86 3bb10e7 07ca27c 3bb10e7 07ca27c 3bb10e7 07ca27c 3bb10e7 07ca27c 3bb10e7 07ca27c 3bb10e7 07ca27c 3bb10e7 07ca27c 3bb10e7 07ca27c 3bb10e7 07ca27c 3bb10e7 445d706 fc6fb86 445d706 3bb10e7 445d706 fc6fb86 3bb10e7 07ca27c fc6fb86 07ca27c fc6fb86 07ca27c 3bb10e7 07ca27c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | ---
library_name: transformers
license: apache-2.0
base_model: google/pix2struct-docvqa-base
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- document-ai
- multimodal-model
- generative-model
- synthetic-data
metrics:
- f1
model-index:
- name: Pix2StructCzechInvoice-V0
results: []
---
# Pix2StructCzechInvoice (V0 – Synthetic Templates Only)
This model is a fine-tuned version of [google/pix2struct-docvqa-base](https://huggingface.co/google/pix2struct-docvqa-base) for structured information extraction from Czech invoices.
It achieves the following results on the evaluation set:
- Loss: 0.5022
- F1: 0.5907
---
## Model description
Pix2StructCzechInvoice (V0) is a generative multimodal model designed for document understanding.
Unlike token classification models (e.g., BERT, LiLT, LayoutLMv3), this model:
- processes the entire document image
- generates structured outputs as text sequences
The model is trained to extract key invoice fields such as:
- supplier
- customer
- invoice number
- bank details
- totals
- dates
---
## Training data
The dataset consists of:
- synthetically generated invoice images
- fixed template layouts
- corresponding target text sequences representing structured fields
Key properties:
- clean and consistent visual structure
- no OCR noise (end-to-end image input)
- controlled output formatting
- no real-world documents
This represents the **baseline dataset for generative multimodal models**.
---
## Role in the pipeline
This model corresponds to:
**V0 – Synthetic template-based dataset only**
It is used to:
- establish a baseline for generative document models
- compare with:
- token classification approaches (BERT, LiLT)
- multimodal encoders (LayoutLMv3)
- evaluate feasibility of end-to-end extraction
---
## Intended uses
- End-to-end invoice information extraction from images
- Document VQA-style tasks
- Research in generative document understanding
- Comparison with structured prediction approaches
---
## Limitations
- Trained only on synthetic data
- Sensitive to output formatting inconsistencies
- Lower stability compared to token classification models
- Requires careful evaluation (string matching vs structured metrics)
- Performance depends on generation quality
---
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 1
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine_with_restarts
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP
---
### Training results
| Training Loss | Epoch | Step | Validation Loss | F1 |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 3.1072 | 1.0 | 300 | 2.9769 | 0.0 |
| 2.6572 | 2.0 | 600 | 2.8684 | 0.0 |
| 2.4810 | 3.0 | 900 | 2.6349 | 0.0 |
| 1.7941 | 4.0 | 1200 | 1.6395 | 0.0 |
| 0.8458 | 5.0 | 1500 | 1.0680 | 0.2173 |
| 0.6198 | 6.0 | 1800 | 0.7713 | 0.4835 |
| 0.1999 | 7.0 | 2100 | 0.4331 | 0.5700 |
| 0.0946 | 8.0 | 2400 | 0.3844 | 0.5907 |
| 0.1020 | 9.0 | 2700 | 0.4066 | 0.4294 |
| 0.0842 | 10.0 | 3000 | 0.5022 | 0.4665 |
---
## Framework versions
- Transformers 5.0.0
- PyTorch 2.10.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.2 |