README.md · TomasFAV/LiLTInvoiceCzechV01 at main

File size: 4,242 Bytes

f2883d2
 
2b396ae
 
f2883d2
 
2b396ae
 
 
 
 
 
 
cae44cc
 
 
 
 
f2883d2
2b396ae
f2883d2
 
 
2b396ae
f2883d2
2b396ae
f2883d2
 
2b396ae
 
 
 
 
 
 
f2883d2
 
 
2b396ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2883d2
2b396ae
f2883d2
2b396ae
 
 
 
 
 
 
 
f2883d2
2b396ae
 
 
 
 
 
 
 
f2883d2
2b396ae
 
 
 
 
 
 
 
f2883d2
 
 
 
 
 
6e65b46
 
f2883d2
 
 
 
 
cae44cc
f2883d2
 
2b396ae
 
cae44cc
 
 
 
6e65b46
 
 
 
 
 
 
 
 
 
cae44cc
2b396ae
cae44cc
2b396ae
f2883d2
2b396ae

---
library_name: transformers
license: mit
base_model: SCUT-DLVCLab/lilt-roberta-en-base
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- document-ai
- layout-aware-model
- synthetic-data
- layout-augmentation
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: LiLTInvoiceCzech-V1
  results: []
---

# LiLTInvoiceCzech (V1 – Synthetic + Random Layout)

This model is a fine-tuned version of [SCUT-DLVCLab/lilt-roberta-en-base](https://huggingface.co/SCUT-DLVCLab/lilt-roberta-en-base) for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:
- Loss: 0.1907  
- Precision: 0.6326  
- Recall: 0.7491  
- F1: 0.6859  
- Accuracy: 0.9660  

---

## Model description

LiLTInvoiceCzech (V1) extends the baseline layout-aware model by introducing layout variability into the training data.

The model performs token-level classification using both textual and spatial (bounding box) information to extract structured invoice fields:
- supplier  
- customer  
- invoice number  
- bank details  
- totals  
- dates  

Compared to V0, this version is trained on synthetically generated invoices with **randomized layouts**, improving robustness to spatial variations.

---

## Training data

The dataset consists of:

- synthetically generated invoices based on templates  
- augmented variants with randomized layout structures  
- corresponding bounding box annotations  

Key properties:
- variable positioning of fields  
- layout perturbations (shifts, spacing, ordering)  
- preserved label consistency  
- fully synthetic data  

This dataset introduces **layout diversity**, which is especially important for layout-aware models.

---

## Role in the pipeline

This model corresponds to:

**V1 – Synthetic templates + randomized layouts**

It is used to:
- evaluate the effect of layout variability on LiLT  
- compare against:
  - V0 (fixed layouts)  
  - later stages with hybrid and real data (V2, V3)  
- analyze how layout-aware models benefit from synthetic augmentation  

---

## Intended uses

- Research in layout-aware document understanding  
- Evaluation of spatial robustness in NLP models  
- Benchmarking LiLT against text-only models (BERT)  
- Czech invoice information extraction  

---

## Limitations

- Still trained only on synthetic data  
- Layout variability is artificial  
- No real-world noise (OCR errors, distortions)  
- May not fully generalize to real invoice distributions  

---

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 16
- eval_batch_size: 2
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

---

### Training results

| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| No log        | 1.0   | 38   | 0.1676          | 0.5917    | 0.6826 | 0.6339 | 0.9639   |
| No log        | 2.0   | 76   | 0.1810          | 0.6123    | 0.6604 | 0.6355 | 0.9643   |
| No log        | 3.0   | 114  | 0.1906          | 0.6317    | 0.7491 | 0.6854 | 0.9660   |
| No log        | 4.0   | 152  | 0.1764          | 0.6380    | 0.6587 | 0.6482 | 0.9659   |
| No log        | 5.0   | 190  | 0.1737          | 0.6544    | 0.6689 | 0.6616 | 0.9696   |
| No log        | 6.0   | 228  | 0.1752          | 0.6728    | 0.6911 | 0.6818 | 0.9695   |
| No log        | 7.0   | 266  | 0.1951          | 0.6083    | 0.6758 | 0.6403 | 0.9658   |
| No log        | 8.0   | 304  | 0.1962          | 0.6162    | 0.6741 | 0.6438 | 0.9656   |
| No log        | 9.0   | 342  | 0.1939          | 0.6700    | 0.6962 | 0.6828 | 0.9701   |
| No log        | 10.0  | 380  | 0.1931          | 0.6645    | 0.6928 | 0.6784 | 0.9696   |

---

## Framework versions

- Transformers 5.0.0  
- PyTorch 2.10.0+cu128  
- Datasets 4.0.0  
- Tokenizers 0.22.2