---
library_name: transformers
license: mit
base_model: SCUT-DLVCLab/lilt-roberta-en-base
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- document-ai
- layout-aware-model
- synthetic-data
- hybrid-data
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: LiLTInvoiceCzech-V2
  results: []
---

# LiLTInvoiceCzech (V2 – Synthetic + Random Layout + Real Layout Injection)

This model is a fine-tuned version of [SCUT-DLVCLab/lilt-roberta-en-base](https://huggingface.co/SCUT-DLVCLab/lilt-roberta-en-base) for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:
- Loss: 0.1123  
- Precision: 0.7716  
- Recall: 0.7782  
- F1: 0.7749  
- Accuracy: 0.9783  

---

## Model description

LiLTInvoiceCzech (V2) represents an advanced stage in the pipeline, combining layout-aware modeling with realistic document structures.

The model performs token-level classification using both textual and spatial (bounding box) features to extract invoice fields:
- supplier  
- customer  
- invoice number  
- bank details  
- totals  
- dates  

This version introduces **real layout injection**, significantly improving the realism of training data.

---

## Training data

The dataset consists of three components:

1. **Synthetic template-based invoices**  
2. **Synthetic invoices with randomized layouts**  
3. **Hybrid invoices with real layouts and synthetic content**  

### Real layout injection

In the hybrid dataset:
- real invoice documents are used as layout templates  
- original content is replaced with synthetic data  
- new content is rendered into authentic spatial structures  

This preserves:
- real-world layout complexity  
- spacing and alignment patterns  
- document-specific structure  

while maintaining:
- full annotation control  
- label consistency  

---

## Role in the pipeline

This model corresponds to:

**V2 – Synthetic + layout augmentation + real layout injection**

It is used to:
- bridge the gap between synthetic and real data  
- evaluate the impact of realistic layouts on a layout-aware model  
- compare with:
  - V0–V1 (fully synthetic)  
  - V3 (real data fine-tuning)  

---

## Intended uses

- Advanced document AI research  
- Evaluation of hybrid synthetic-real datasets  
- Benchmarking layout-aware architectures  
- Czech invoice information extraction  

---

## Limitations

- Text content is still synthetic  
- Does not fully capture linguistic variability of real invoices  
- Limited exposure to OCR noise and scanning artifacts  
- May still struggle with rare real-world edge cases  

---

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 16
- eval_batch_size: 2
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

---

### Training results

| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| No log        | 1.0   | 58   | 0.1026          | 0.5982    | 0.6758 | 0.6346 | 0.9680   |
| No log        | 2.0   | 116  | 0.0993          | 0.7140    | 0.6775 | 0.6953 | 0.9745   |
| No log        | 3.0   | 174  | 0.1024          | 0.7227    | 0.7116 | 0.7171 | 0.9756   |
| No log        | 4.0   | 232  | 0.1198          | 0.6538    | 0.7543 | 0.7005 | 0.9708   |
| No log        | 5.0   | 290  | 0.1150          | 0.7157    | 0.7218 | 0.7188 | 0.9749   |
| No log        | 6.0   | 348  | 0.1133          | 0.7095    | 0.7628 | 0.7352 | 0.9750   |
| No log        | 7.0   | 406  | 0.1122          | 0.7716    | 0.7782 | 0.7749 | 0.9783   |
| No log        | 8.0   | 464  | 0.1168          | 0.7311    | 0.7747 | 0.7523 | 0.9762   |
| 0.0341        | 9.0   | 522  | 0.1237          | 0.7249    | 0.7645 | 0.7442 | 0.9757   |
| 0.0341        | 10.0  | 580  | 0.1218          | 0.7447    | 0.7867 | 0.7651 | 0.9768   |

---

## Framework versions

- Transformers 5.0.0  
- PyTorch 2.10.0+cu128  
- Datasets 4.0.0  
- Tokenizers 0.22.2