| --- |
| library_name: transformers |
| license: apache-2.0 |
| base_model: google/pix2struct-docvqa-base |
| tags: |
| - generated_from_trainer |
| - invoice-processing |
| - information-extraction |
| - czech-language |
| - document-ai |
| - multimodal-model |
| - generative-model |
| - synthetic-data |
| - hybrid-data |
| metrics: |
| - f1 |
| model-index: |
| - name: Pix2StructCzechInvoice-V2 |
| results: [] |
| --- |
| |
| # Pix2StructCzechInvoice (V2 – Synthetic + Random Layout + Real Layout Injection) |
|
|
| This model is a fine-tuned version of [google/pix2struct-docvqa-base](https://huggingface.co/google/pix2struct-docvqa-base) for structured information extraction from Czech invoices. |
|
|
| It achieves the following results on the evaluation set: |
| - Loss: 0.2521 |
| - F1: 0.7311 |
|
|
| --- |
|
|
| ## Model description |
|
|
| Pix2StructCzechInvoice (V2) represents an advanced stage of the generative document understanding pipeline. |
|
|
| The model: |
| - processes full document images |
| - generates structured outputs as text sequences |
|
|
| It is trained to extract key invoice fields: |
| - supplier |
| - customer |
| - invoice number |
| - bank details |
| - totals |
| - dates |
|
|
| This version introduces **real layout injection**, significantly improving visual realism and model generalization. |
|
|
| --- |
|
|
| ## Training data |
|
|
| The dataset consists of three components: |
|
|
| 1. **Synthetic template-based invoices** |
| 2. **Synthetic invoices with randomized layouts** |
| 3. **Hybrid invoices with real layouts and synthetic content** |
|
|
| ### Real layout injection |
|
|
| In the hybrid dataset: |
| - real invoice layouts are used as templates |
| - original content is replaced with synthetic data |
| - new content is rendered into realistic visual structures |
|
|
| This preserves: |
| - real-world layout complexity |
| - visual patterns and formatting |
| - document structure variability |
|
|
| while maintaining: |
| - full control over annotations |
| - consistent output format |
|
|
| --- |
|
|
| ## Role in the pipeline |
|
|
| This model corresponds to: |
|
|
| **V2 – Synthetic + layout augmentation + real layout injection** |
|
|
| It is used to: |
| - reduce the domain gap between synthetic and real documents |
| - evaluate the effect of realistic layouts on generative models |
| - compare with: |
| - V0–V1 (synthetic-only training) |
| - V3 (real data fine-tuning) |
|
|
| --- |
|
|
| ## Intended uses |
|
|
| - End-to-end invoice extraction from images |
| - Document VQA-style tasks |
| - Research in generative document understanding |
| - Evaluation of hybrid training strategies |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - Generated outputs may contain formatting errors |
| - Sensitive to decoding strategy and tokenization |
| - Still lacks full exposure to real linguistic variability |
| - Training remains less stable than classification-based models |
|
|
| --- |
|
|
| ## Training procedure |
|
|
| ### Training hyperparameters |
|
|
| The following hyperparameters were used during training: |
| - learning_rate: 0.0001 |
| - train_batch_size: 8 |
| - eval_batch_size: 1 |
| - seed: 42 |
| - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
| - lr_scheduler_type: cosine_with_restarts |
| - lr_scheduler_warmup_steps: 0.1 |
| - num_epochs: 10 |
| - mixed_precision_training: Native AMP |
|
|
| --- |
|
|
| ### Training results |
|
|
| | Training Loss | Epoch | Step | Validation Loss | F1 | |
| |:-------------:|:-----:|:----:|:---------------:|:------:| |
| | 0.3432 | 1.0 | 115 | 0.2771 | 0.6644 | |
| | 0.1942 | 2.0 | 230 | 0.2611 | 0.6745 | |
| | 0.1934 | 3.0 | 345 | 0.2521 | 0.7311 | |
| | 0.1325 | 4.0 | 460 | 0.2665 | 0.7133 | |
| | 0.1131 | 5.0 | 575 | 0.2686 | 0.6762 | |
| | 0.1125 | 6.0 | 690 | 0.2601 | 0.7277 | |
| | 0.1011 | 7.0 | 805 | 0.2962 | 0.7118 | |
| | 0.1229 | 8.0 | 920 | 0.2893 | 0.7095 | |
| | 0.0861 | 9.0 | 1035 | 0.3019 | 0.6931 | |
| | 0.0860 | 10.0 | 1150 | 0.3167 | 0.7186 | |
|
|
| --- |
|
|
| ## Framework versions |
|
|
| - Transformers 5.0.0 |
| - PyTorch 2.10.0+cu128 |
| - Datasets 4.0.0 |
| - Tokenizers 0.22.2 |