Instructions to use TomasFAV/Pix2StructCzechInvoiceV0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TomasFAV/Pix2StructCzechInvoiceV0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="TomasFAV/Pix2StructCzechInvoiceV0")

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("TomasFAV/Pix2StructCzechInvoiceV0")
model = AutoModelForMultimodalLM.from_pretrained("TomasFAV/Pix2StructCzechInvoiceV0")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use TomasFAV/Pix2StructCzechInvoiceV0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TomasFAV/Pix2StructCzechInvoiceV0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TomasFAV/Pix2StructCzechInvoiceV0",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/TomasFAV/Pix2StructCzechInvoiceV0

SGLang

How to use TomasFAV/Pix2StructCzechInvoiceV0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TomasFAV/Pix2StructCzechInvoiceV0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TomasFAV/Pix2StructCzechInvoiceV0",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TomasFAV/Pix2StructCzechInvoiceV0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TomasFAV/Pix2StructCzechInvoiceV0",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use TomasFAV/Pix2StructCzechInvoiceV0 with Docker Model Runner:
```
docker model run hf.co/TomasFAV/Pix2StructCzechInvoiceV0
```

Pix2StructCzechInvoice (V0 – Synthetic Templates Only)

This model is a fine-tuned version of google/pix2struct-docvqa-base for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

Loss: 0.5022
F1: 0.5907

Model description

Pix2StructCzechInvoice (V0) is a generative multimodal model designed for document understanding.

Unlike token classification models (e.g., BERT, LiLT, LayoutLMv3), this model:

processes the entire document image
generates structured outputs as text sequences

The model is trained to extract key invoice fields such as:

supplier
customer
invoice number
bank details
totals
dates

Training data

The dataset consists of:

synthetically generated invoice images
fixed template layouts
corresponding target text sequences representing structured fields

Key properties:

clean and consistent visual structure
no OCR noise (end-to-end image input)
controlled output formatting
no real-world documents

This represents the baseline dataset for generative multimodal models.

Role in the pipeline

This model corresponds to:

V0 – Synthetic template-based dataset only

It is used to:

establish a baseline for generative document models
compare with:
- token classification approaches (BERT, LiLT)
- multimodal encoders (LayoutLMv3)
evaluate feasibility of end-to-end extraction

Intended uses

End-to-end invoice information extraction from images
Document VQA-style tasks
Research in generative document understanding
Comparison with structured prediction approaches

Limitations

Trained only on synthetic data
Sensitive to output formatting inconsistencies
Lower stability compared to token classification models
Requires careful evaluation (string matching vs structured metrics)
Performance depends on generation quality

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 1
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_steps: 0.1
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	F1
3.1072	1.0	300	2.9769	0.0
2.6572	2.0	600	2.8684	0.0
2.4810	3.0	900	2.6349	0.0
1.7941	4.0	1200	1.6395	0.0
0.8458	5.0	1500	1.0680	0.2173
0.6198	6.0	1800	0.7713	0.4835
0.1999	7.0	2100	0.4331	0.5700
0.0946	8.0	2400	0.3844	0.5907
0.1020	9.0	2700	0.4066	0.4294
0.0842	10.0	3000	0.5022	0.4665