--- library_name: transformers license: mit base_model: SCUT-DLVCLab/lilt-roberta-en-base tags: - generated_from_trainer - invoice-processing - information-extraction - czech-language - document-ai - layout-aware-model - synthetic-data - hybrid-data metrics: - precision - recall - f1 - accuracy model-index: - name: LiLTInvoiceCzech-V2 results: [] --- # LiLTInvoiceCzech (V2 – Synthetic + Random Layout + Real Layout Injection) This model is a fine-tuned version of [SCUT-DLVCLab/lilt-roberta-en-base](https://huggingface.co/SCUT-DLVCLab/lilt-roberta-en-base) for structured information extraction from Czech invoices. It achieves the following results on the evaluation set: - Loss: 0.1123 - Precision: 0.7716 - Recall: 0.7782 - F1: 0.7749 - Accuracy: 0.9783 --- ## Model description LiLTInvoiceCzech (V2) represents an advanced stage in the pipeline, combining layout-aware modeling with realistic document structures. The model performs token-level classification using both textual and spatial (bounding box) features to extract invoice fields: - supplier - customer - invoice number - bank details - totals - dates This version introduces **real layout injection**, significantly improving the realism of training data. --- ## Training data The dataset consists of three components: 1. **Synthetic template-based invoices** 2. **Synthetic invoices with randomized layouts** 3. **Hybrid invoices with real layouts and synthetic content** ### Real layout injection In the hybrid dataset: - real invoice documents are used as layout templates - original content is replaced with synthetic data - new content is rendered into authentic spatial structures This preserves: - real-world layout complexity - spacing and alignment patterns - document-specific structure while maintaining: - full annotation control - label consistency --- ## Role in the pipeline This model corresponds to: **V2 – Synthetic + layout augmentation + real layout injection** It is used to: - bridge the gap between synthetic and real data - evaluate the impact of realistic layouts on a layout-aware model - compare with: - V0–V1 (fully synthetic) - V3 (real data fine-tuning) --- ## Intended uses - Advanced document AI research - Evaluation of hybrid synthetic-real datasets - Benchmarking layout-aware architectures - Czech invoice information extraction --- ## Limitations - Text content is still synthetic - Does not fully capture linguistic variability of real invoices - Limited exposure to OCR noise and scanning artifacts - May still struggle with rare real-world edge cases --- ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 3e-05 - train_batch_size: 16 - eval_batch_size: 2 - seed: 42 - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 0.1 - num_epochs: 10 - mixed_precision_training: Native AMP --- ### Training results | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:| | No log | 1.0 | 58 | 0.1026 | 0.5982 | 0.6758 | 0.6346 | 0.9680 | | No log | 2.0 | 116 | 0.0993 | 0.7140 | 0.6775 | 0.6953 | 0.9745 | | No log | 3.0 | 174 | 0.1024 | 0.7227 | 0.7116 | 0.7171 | 0.9756 | | No log | 4.0 | 232 | 0.1198 | 0.6538 | 0.7543 | 0.7005 | 0.9708 | | No log | 5.0 | 290 | 0.1150 | 0.7157 | 0.7218 | 0.7188 | 0.9749 | | No log | 6.0 | 348 | 0.1133 | 0.7095 | 0.7628 | 0.7352 | 0.9750 | | No log | 7.0 | 406 | 0.1122 | 0.7716 | 0.7782 | 0.7749 | 0.9783 | | No log | 8.0 | 464 | 0.1168 | 0.7311 | 0.7747 | 0.7523 | 0.9762 | | 0.0341 | 9.0 | 522 | 0.1237 | 0.7249 | 0.7645 | 0.7442 | 0.9757 | | 0.0341 | 10.0 | 580 | 0.1218 | 0.7447 | 0.7867 | 0.7651 | 0.9768 | --- ## Framework versions - Transformers 5.0.0 - PyTorch 2.10.0+cu128 - Datasets 4.0.0 - Tokenizers 0.22.2