LiLTInvoiceCzech (V3 – Full Pipeline with Real Data Fine-Tuning)

This model is a fine-tuned version of SCUT-DLVCLab/lilt-roberta-en-base for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

  • Loss: 0.0358
  • Precision: 0.8840
  • Recall: 0.8976
  • F1: 0.8908
  • Accuracy: 0.9910

Model description

LiLTInvoiceCzech (V3) is the final and best-performing model in the experimental pipeline.

The model performs token-level classification using both textual and spatial (bounding box) information to extract structured invoice fields:

  • supplier
  • customer
  • invoice number
  • bank details
  • totals
  • dates

By combining layout-aware architecture with progressively more realistic data, this version achieves strong performance on real-world-like documents.


Training data

The dataset used in this stage combines:

  1. Synthetic template-based invoices (V0)
  2. Synthetic invoices with randomized layouts (V1)
  3. Hybrid invoices with real layouts and synthetic content (V2)
  4. Real annotated invoices

Real data fine-tuning

The final stage introduces:

  • real invoice documents
  • annotated entity spans
  • natural linguistic variability
  • real formatting inconsistencies and layout noise

This enables the model to:

  • adapt to real-world distributions
  • leverage both spatial and textual patterns
  • achieve high robustness and generalization

Role in the pipeline

This model corresponds to:

V3 – Full pipeline (synthetic + hybrid + real data fine-tuning)

It represents:

  • the final model in the LiLT branch
  • the best-performing configuration
  • a production-ready layout-aware solution

Intended uses

  • Real-world invoice information extraction
  • Document AI systems with layout awareness
  • OCR post-processing pipelines with spatial features
  • Benchmarking layout-aware architectures

Limitations

  • Depends on quality of OCR and bounding box extraction
  • May struggle with:
    • extremely noisy scans
    • highly non-standard invoice formats
  • Domain-specific (Czech invoices)
  • Requires structured input (tokens + bounding boxes)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 16
  • eval_batch_size: 2
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 0.1
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
No log 1.0 12 0.0636 0.7820 0.8140 0.7977 0.9819
No log 2.0 24 0.0472 0.8499 0.8020 0.8253 0.9855
No log 3.0 36 0.0446 0.8293 0.8874 0.8574 0.9873
No log 4.0 48 0.0393 0.8555 0.8788 0.8670 0.9891
No log 5.0 60 0.0359 0.8872 0.8720 0.8795 0.9905
No log 6.0 72 0.0366 0.8870 0.8840 0.8855 0.9905
No log 7.0 84 0.0358 0.8826 0.8976 0.8900 0.9909
No log 8.0 96 0.0360 0.8822 0.8942 0.8881 0.9907
No log 9.0 108 0.0374 0.8696 0.8993 0.8842 0.9904
No log 10.0 120 0.0366 0.8783 0.8993 0.8887 0.9908

Framework versions

  • Transformers 5.0.0
  • PyTorch 2.10.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
188
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/LiLTInvoiceCzechV0123WORSEF1

Finetuned
(56)
this model