Add model card

195d7c4 verified 3 days ago

2.46 kB

language: en
license: cc-by-4.0
tags:
  - document-understanding
  - layout
  - token-classification
  - lilt
datasets:
  - bluecopa/samyx-document-ser
metrics:
  - f1
pipeline_tag: token-classification

LiLT SER — Document Key/Value Extraction

Fine-tuned LiLT for Semantic Entity Recognition on documents.

Given words + bounding boxes from any document, labels each word as: key, value, header, or other.

Results

Benchmark	Metric	Score
SROIE (626 receipts)	Recall	98.8%
SROIE Total field	Recall	96.3%
SROIE Company field	Recall	100%
FUNSD (50 forms)	Token Accuracy	79.7%
Val set (3,577 docs)	Entity F1	97.3%

Training Data

Trained on 17,881 documents from 8 sources:

FATURA (10K invoices), CORD (1K receipts), WildReceipt (1.7K receipts)
VRDU Registration + Ad-Buy forms (2.6K government documents)
Multi-type OCR forms + invoices (1.2K)
Invoice OCR (1.4K multi-layout invoices)

Dataset: bluecopa/samyx-document-ser

Labels

ID	Label	Description
0	O	Background / other
1	B-key	Beginning of key/label
2	I-key	Inside key/label
3	B-value	Beginning of value
4	I-value	Inside value
5	B-header	Beginning of header
6	I-header	Inside header

Usage

from transformers import AutoTokenizer, LiltForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("bluecopa/lilt-ser-document-extraction")
model = LiltForTokenClassification.from_pretrained("bluecopa/lilt-ser-document-extraction")

# words and boxes from OCR
words = ["Invoice", "No:", "12345", "Date:", "2024-01-15"]
boxes = [[100, 50, 200, 70], [210, 50, 250, 70], [260, 50, 350, 70],
         [100, 80, 180, 100], [190, 80, 320, 100]]

encoding = tokenizer(words, boxes=boxes, is_split_into_words=True,
                     return_tensors="pt", padding=True)
outputs = model(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()

Architecture

Base model: SCUT-DLVCLab/lilt-infoxlm-base (284M params, multilingual)
Task: Token classification (7 labels)
Input: words + bounding boxes (normalized 0-1000)
Context: 1024 tokens (2x default, covers any single page)
Inference: ~50ms/page on CPU via ONNX