metadata
language: en
license: cc-by-4.0
tags:
- document-understanding
- layout
- token-classification
- lilt
datasets:
- bluecopa/samyx-document-ser
metrics:
- f1
pipeline_tag: token-classification
LiLT SER — Document Key/Value Extraction
Fine-tuned LiLT for Semantic Entity Recognition on documents.
Given words + bounding boxes from any document, labels each word as: key, value, header, or other.
Results
| Benchmark | Metric | Score |
|---|---|---|
| SROIE (626 receipts) | Recall | 98.8% |
| SROIE Total field | Recall | 96.3% |
| SROIE Company field | Recall | 100% |
| FUNSD (50 forms) | Token Accuracy | 79.7% |
| Val set (3,577 docs) | Entity F1 | 97.3% |
Training Data
Trained on 17,881 documents from 8 sources:
- FATURA (10K invoices), CORD (1K receipts), WildReceipt (1.7K receipts)
- VRDU Registration + Ad-Buy forms (2.6K government documents)
- Multi-type OCR forms + invoices (1.2K)
- Invoice OCR (1.4K multi-layout invoices)
Dataset: bluecopa/samyx-document-ser
Labels
| ID | Label | Description |
|---|---|---|
| 0 | O | Background / other |
| 1 | B-key | Beginning of key/label |
| 2 | I-key | Inside key/label |
| 3 | B-value | Beginning of value |
| 4 | I-value | Inside value |
| 5 | B-header | Beginning of header |
| 6 | I-header | Inside header |
Usage
from transformers import AutoTokenizer, LiltForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("bluecopa/lilt-ser-document-extraction")
model = LiltForTokenClassification.from_pretrained("bluecopa/lilt-ser-document-extraction")
# words and boxes from OCR
words = ["Invoice", "No:", "12345", "Date:", "2024-01-15"]
boxes = [[100, 50, 200, 70], [210, 50, 250, 70], [260, 50, 350, 70],
[100, 80, 180, 100], [190, 80, 320, 100]]
encoding = tokenizer(words, boxes=boxes, is_split_into_words=True,
return_tensors="pt", padding=True)
outputs = model(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()
Architecture
- Base model: SCUT-DLVCLab/lilt-infoxlm-base (284M params, multilingual)
- Task: Token classification (7 labels)
- Input: words + bounding boxes (normalized 0-1000)
- Context: 1024 tokens (2x default, covers any single page)
- Inference: ~50ms/page on CPU via ONNX