---
library_name: transformers
license: apache-2.0
base_model: google/pix2struct-docvqa-base
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- document-ai
- multimodal-model
- generative-model
- synthetic-data
- hybrid-data
- real-data
metrics:
- f1
model-index:
- name: Pix2StructCzechInvoice-V3
  results: []
---

# Pix2StructCzechInvoice (V3 – Full Pipeline with Real Data Fine-Tuning)

This model is a fine-tuned version of [google/pix2struct-docvqa-base](https://huggingface.co/google/pix2struct-docvqa-base) for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:
- Loss: 0.1542  
- F1: 0.8404  

---

## Model description

Pix2StructCzechInvoice (V3) is the final generative model in the experimental pipeline.

Unlike token classification approaches, this model:
- processes full document images  
- generates structured outputs as text sequences  

It extracts key invoice fields such as:
- supplier  
- customer  
- invoice number  
- bank details  
- totals  
- dates  

By combining synthetic, hybrid, and real data, this version significantly improves both performance and stability.

---

## Training data

The dataset used in this stage combines:

1. **Synthetic template-based invoices (V0)**  
2. **Synthetic invoices with randomized layouts (V1)**  
3. **Hybrid invoices with real layouts and synthetic content (V2)**  
4. **Real annotated invoices**  

### Real data fine-tuning

The final stage introduces:
- real invoice images  
- realistic visual noise and distortions  
- natural language variability  
- real formatting inconsistencies  

This allows the model to:
- better align generated outputs with real-world distributions  
- improve robustness of sequence generation  
- reduce hallucinations and formatting errors  

---

## Role in the pipeline

This model corresponds to:

**V3 – Full pipeline (synthetic + hybrid + real data fine-tuning)**

It represents:
- the final generative model  
- the best-performing Pix2Struct variant  
- an end-to-end extraction approach  

---

## Intended uses

- End-to-end invoice information extraction from images  
- Document VQA and generative document understanding  
- OCR-free document processing pipelines  
- Research in generative vs structured extraction approaches  

---

## Limitations

- Output format may still be inconsistent  
- Sensitive to decoding strategy and prompt structure  
- Less interpretable than token classification models  
- Requires post-processing for structured outputs  
- Computationally more expensive  

---

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 1
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine_with_restarts
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

---

### Training results

| Training Loss | Epoch | Step | Validation Loss | F1     |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 0.3277        | 1.0   | 23   | 0.1958          | 0.7239 |
| 0.2366        | 2.0   | 46   | 0.1446          | 0.8037 |
| 0.1780        | 3.0   | 69   | 0.1247          | 0.8060 |
| 0.1153        | 4.0   | 92   | 0.1178          | 0.8316 |
| 0.0895        | 5.0   | 115  | 0.1279          | 0.8312 |
| 0.0774        | 6.0   | 138  | 0.1542          | 0.8404 |
| 0.0766        | 7.0   | 161  | 0.1530          | 0.7972 |
| 0.0697        | 8.0   | 184  | 0.1385          | 0.8372 |
| 0.0804        | 9.0   | 207  | 0.1433          | 0.7963 |
| 0.0664        | 10.0  | 230  | 0.1614          | 0.7991 |

---

## Framework versions

- Transformers 5.0.0  
- PyTorch 2.10.0+cu128  
- Datasets 4.0.0  
- Tokenizers 0.22.2