LiLT SER โ€” Document Key/Value Extraction

Fine-tuned LiLT for Semantic Entity Recognition on documents.

Given words + bounding boxes from any document, labels each word as: key, value, header, or other.

Results

Benchmark Metric Score
SROIE (626 receipts) Recall 98.8%
SROIE Total field Recall 96.3%
SROIE Company field Recall 100%
FUNSD (50 forms) Token Accuracy 79.7%
Val set (3,577 docs) Entity F1 97.3%

Training Data

Trained on 17,881 documents from 8 sources:

  • FATURA (10K invoices), CORD (1K receipts), WildReceipt (1.7K receipts)
  • VRDU Registration + Ad-Buy forms (2.6K government documents)
  • Multi-type OCR forms + invoices (1.2K)
  • Invoice OCR (1.4K multi-layout invoices)

Dataset: bluecopa/samyx-document-ser

Labels

ID Label Description
0 O Background / other
1 B-key Beginning of key/label
2 I-key Inside key/label
3 B-value Beginning of value
4 I-value Inside value
5 B-header Beginning of header
6 I-header Inside header

Usage

from transformers import AutoTokenizer, LiltForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("bluecopa/lilt-ser-document-extraction")
model = LiltForTokenClassification.from_pretrained("bluecopa/lilt-ser-document-extraction")

# words and boxes from OCR
words = ["Invoice", "No:", "12345", "Date:", "2024-01-15"]
boxes = [[100, 50, 200, 70], [210, 50, 250, 70], [260, 50, 350, 70],
         [100, 80, 180, 100], [190, 80, 320, 100]]

encoding = tokenizer(words, boxes=boxes, is_split_into_words=True,
                     return_tensors="pt", padding=True)
outputs = model(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()

Architecture

  • Base model: SCUT-DLVCLab/lilt-infoxlm-base (284M params, multilingual)
  • Task: Token classification (7 labels)
  • Input: words + bounding boxes (normalized 0-1000)
  • Context: 1024 tokens (2x default, covers any single page)
  • Inference: ~50ms/page on CPU via ONNX
Downloads last month
38
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train bluecopa/lilt-ser-document-extraction