TomasFAV's picture
Update README.md
77f6c68 verified
---
library_name: transformers
license: mit
base_model: SCUT-DLVCLab/lilt-roberta-en-base
tags:
- generated_from_trainer
- invoice-processing
- information-extraction
- czech-language
- document-ai
- layout-aware-model
- synthetic-data
- hybrid-data
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: LiLTInvoiceCzech-V2
results: []
---
# LiLTInvoiceCzech (V2 – Synthetic + Random Layout + Real Layout Injection)
This model is a fine-tuned version of [SCUT-DLVCLab/lilt-roberta-en-base](https://huggingface.co/SCUT-DLVCLab/lilt-roberta-en-base) for structured information extraction from Czech invoices.
It achieves the following results on the evaluation set:
- Loss: 0.1123
- Precision: 0.7716
- Recall: 0.7782
- F1: 0.7749
- Accuracy: 0.9783
---
## Model description
LiLTInvoiceCzech (V2) represents an advanced stage in the pipeline, combining layout-aware modeling with realistic document structures.
The model performs token-level classification using both textual and spatial (bounding box) features to extract invoice fields:
- supplier
- customer
- invoice number
- bank details
- totals
- dates
This version introduces **real layout injection**, significantly improving the realism of training data.
---
## Training data
The dataset consists of three components:
1. **Synthetic template-based invoices**
2. **Synthetic invoices with randomized layouts**
3. **Hybrid invoices with real layouts and synthetic content**
### Real layout injection
In the hybrid dataset:
- real invoice documents are used as layout templates
- original content is replaced with synthetic data
- new content is rendered into authentic spatial structures
This preserves:
- real-world layout complexity
- spacing and alignment patterns
- document-specific structure
while maintaining:
- full annotation control
- label consistency
---
## Role in the pipeline
This model corresponds to:
**V2 – Synthetic + layout augmentation + real layout injection**
It is used to:
- bridge the gap between synthetic and real data
- evaluate the impact of realistic layouts on a layout-aware model
- compare with:
- V0–V1 (fully synthetic)
- V3 (real data fine-tuning)
---
## Intended uses
- Advanced document AI research
- Evaluation of hybrid synthetic-real datasets
- Benchmarking layout-aware architectures
- Czech invoice information extraction
---
## Limitations
- Text content is still synthetic
- Does not fully capture linguistic variability of real invoices
- Limited exposure to OCR noise and scanning artifacts
- May still struggle with rare real-world edge cases
---
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 16
- eval_batch_size: 2
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP
---
### Training results
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| No log | 1.0 | 58 | 0.1026 | 0.5982 | 0.6758 | 0.6346 | 0.9680 |
| No log | 2.0 | 116 | 0.0993 | 0.7140 | 0.6775 | 0.6953 | 0.9745 |
| No log | 3.0 | 174 | 0.1024 | 0.7227 | 0.7116 | 0.7171 | 0.9756 |
| No log | 4.0 | 232 | 0.1198 | 0.6538 | 0.7543 | 0.7005 | 0.9708 |
| No log | 5.0 | 290 | 0.1150 | 0.7157 | 0.7218 | 0.7188 | 0.9749 |
| No log | 6.0 | 348 | 0.1133 | 0.7095 | 0.7628 | 0.7352 | 0.9750 |
| No log | 7.0 | 406 | 0.1122 | 0.7716 | 0.7782 | 0.7749 | 0.9783 |
| No log | 8.0 | 464 | 0.1168 | 0.7311 | 0.7747 | 0.7523 | 0.9762 |
| 0.0341 | 9.0 | 522 | 0.1237 | 0.7249 | 0.7645 | 0.7442 | 0.9757 |
| 0.0341 | 10.0 | 580 | 0.1218 | 0.7447 | 0.7867 | 0.7651 | 0.9768 |
---
## Framework versions
- Transformers 5.0.0
- PyTorch 2.10.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.2