---
library_name: transformers
license: apache-2.0
base_model: google/pix2struct-docvqa-base
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- document-ai
- multimodal-model
- generative-model
- synthetic-data
- hybrid-data
metrics:
- f1
model-index:
- name: Pix2StructCzechInvoice-V2
  results: []
---

# Pix2StructCzechInvoice (V2 – Synthetic + Random Layout + Real Layout Injection)

This model is a fine-tuned version of [google/pix2struct-docvqa-base](https://huggingface.co/google/pix2struct-docvqa-base) for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:
- Loss: 0.2521  
- F1: 0.7311  

---

## Model description

Pix2StructCzechInvoice (V2) represents an advanced stage of the generative document understanding pipeline.

The model:
- processes full document images  
- generates structured outputs as text sequences  

It is trained to extract key invoice fields:
- supplier  
- customer  
- invoice number  
- bank details  
- totals  
- dates  

This version introduces **real layout injection**, significantly improving visual realism and model generalization.

---

## Training data

The dataset consists of three components:

1. **Synthetic template-based invoices**  
2. **Synthetic invoices with randomized layouts**  
3. **Hybrid invoices with real layouts and synthetic content**  

### Real layout injection

In the hybrid dataset:
- real invoice layouts are used as templates  
- original content is replaced with synthetic data  
- new content is rendered into realistic visual structures  

This preserves:
- real-world layout complexity  
- visual patterns and formatting  
- document structure variability  

while maintaining:
- full control over annotations  
- consistent output format  

---

## Role in the pipeline

This model corresponds to:

**V2 – Synthetic + layout augmentation + real layout injection**

It is used to:
- reduce the domain gap between synthetic and real documents  
- evaluate the effect of realistic layouts on generative models  
- compare with:
  - V0–V1 (synthetic-only training)  
  - V3 (real data fine-tuning)  

---

## Intended uses

- End-to-end invoice extraction from images  
- Document VQA-style tasks  
- Research in generative document understanding  
- Evaluation of hybrid training strategies  

---

## Limitations

- Generated outputs may contain formatting errors  
- Sensitive to decoding strategy and tokenization  
- Still lacks full exposure to real linguistic variability  
- Training remains less stable than classification-based models  

---

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 1
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine_with_restarts
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

---

### Training results

| Training Loss | Epoch | Step | Validation Loss | F1     |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 0.3432        | 1.0   | 115  | 0.2771          | 0.6644 |
| 0.1942        | 2.0   | 230  | 0.2611          | 0.6745 |
| 0.1934        | 3.0   | 345  | 0.2521          | 0.7311 |
| 0.1325        | 4.0   | 460  | 0.2665          | 0.7133 |
| 0.1131        | 5.0   | 575  | 0.2686          | 0.6762 |
| 0.1125        | 6.0   | 690  | 0.2601          | 0.7277 |
| 0.1011        | 7.0   | 805  | 0.2962          | 0.7118 |
| 0.1229        | 8.0   | 920  | 0.2893          | 0.7095 |
| 0.0861        | 9.0   | 1035 | 0.3019          | 0.6931 |
| 0.0860        | 10.0  | 1150 | 0.3167          | 0.7186 |

---

## Framework versions

- Transformers 5.0.0  
- PyTorch 2.10.0+cu128  
- Datasets 4.0.0  
- Tokenizers 0.22.2