---
library_name: transformers
license: mit
base_model: SCUT-DLVCLab/lilt-roberta-en-base
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- document-ai
- layout-aware-model
- synthetic-data
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: LiLTInvoiceCzech-V0
  results: []
---

# LiLTInvoiceCzech (V0 – Synthetic Templates Only)

This model is a fine-tuned version of [SCUT-DLVCLab/lilt-roberta-en-base](https://huggingface.co/SCUT-DLVCLab/lilt-roberta-en-base) for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:
- Loss: 0.1929  
- Precision: 0.6036  
- Recall: 0.7355  
- F1: 0.6631  
- Accuracy: 0.9645  

---

## Model description

LiLTInvoiceCzech (V0) is a layout-aware model based on the LiLT architecture, designed for document understanding tasks.

The model performs token-level classification with explicit use of layout information (bounding boxes), allowing it to better capture spatial relationships between invoice fields such as:
- supplier  
- customer  
- invoice number  
- bank details  
- totals  
- dates  

This version is trained exclusively on synthetically generated invoice templates.

---

## Training data

The dataset consists of:

- synthetically generated invoices  
- fixed template layouts  
- associated bounding box annotations for each token  

Key properties:
- consistent spatial structure  
- clean and noise-free data  
- precise alignment between text and layout  
- no real-world documents  

This represents the **baseline dataset** for layout-aware models in the pipeline.

---

## Role in the pipeline

This model corresponds to:

**V0 – Synthetic template-based dataset only**

It is used to:
- establish a baseline for LiLT architecture  
- compare layout-aware vs text-only models (e.g., BERT)  
- evaluate the benefit of spatial features in a controlled setting  

---

## Intended uses

- Document AI research with layout-aware models  
- Benchmarking LiLT on structured documents  
- Comparison with other architectures (BERT, LayoutLMv3, etc.)  
- Czech invoice information extraction  

---

## Limitations

- Trained only on synthetic data with fixed layouts  
- Limited robustness to layout variability  
- No exposure to real-world noise (OCR errors, distortions)  
- Synthetic layouts may not reflect real invoice diversity  

---

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 16
- eval_batch_size: 2
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

---

### Training results

| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| No log        | 1.0   | 75   | 0.2174          | 0.2653    | 0.3038 | 0.2832 | 0.9430   |
| No log        | 2.0   | 150  | 0.1504          | 0.5052    | 0.5751 | 0.5379 | 0.9642   |
| No log        | 3.0   | 225  | 0.1508          | 0.5626    | 0.6365 | 0.5973 | 0.9650   |
| No log        | 4.0   | 300  | 0.1742          | 0.5192    | 0.6689 | 0.5846 | 0.9593   |
| No log        | 5.0   | 375  | 0.1863          | 0.5153    | 0.6877 | 0.5892 | 0.9579   |
| No log        | 6.0   | 450  | 0.1878          | 0.5557    | 0.7065 | 0.6221 | 0.9605   |
| 0.1991        | 7.0   | 525  | 0.2189          | 0.5435    | 0.7253 | 0.6213 | 0.9578   |
| 0.1991        | 8.0   | 600  | 0.1927          | 0.6036    | 0.7355 | 0.6631 | 0.9645   |
| 0.1991        | 9.0   | 675  | 0.2133          | 0.5357    | 0.7167 | 0.6131 | 0.9583   |
| 0.1991        | 10.0  | 750  | 0.2198          | 0.5235    | 0.7235 | 0.6074 | 0.9569   |

---

## Framework versions

- Transformers 5.0.0  
- PyTorch 2.10.0+cu128  
- Datasets 4.0.0  
- Tokenizers 0.22.2