---
language: en
license: cc-by-4.0
tags:
- document-understanding
- layout
- token-classification
- lilt
datasets:
- bluecopa/samyx-document-ser
metrics:
- f1
pipeline_tag: token-classification
---

# LiLT SER — Document Key/Value Extraction

Fine-tuned [LiLT](https://huggingface.co/SCUT-DLVCLab/lilt-infoxlm-base) for **Semantic Entity Recognition** on documents.

Given words + bounding boxes from any document, labels each word as: key, value, header, or other.

## Results

| Benchmark | Metric | Score |
|-----------|--------|-------|
| SROIE (626 receipts) | Recall | **98.8%** |
| SROIE Total field | Recall | **96.3%** |
| SROIE Company field | Recall | **100%** |
| FUNSD (50 forms) | Token Accuracy | **79.7%** |
| Val set (3,577 docs) | Entity F1 | **97.3%** |

## Training Data

Trained on 17,881 documents from 8 sources:
- FATURA (10K invoices), CORD (1K receipts), WildReceipt (1.7K receipts)
- VRDU Registration + Ad-Buy forms (2.6K government documents)
- Multi-type OCR forms + invoices (1.2K)
- Invoice OCR (1.4K multi-layout invoices)

Dataset: [bluecopa/samyx-document-ser](https://huggingface.co/datasets/bluecopa/samyx-document-ser)

## Labels

| ID | Label | Description |
|----|-------|-------------|
| 0 | O | Background / other |
| 1 | B-key | Beginning of key/label |
| 2 | I-key | Inside key/label |
| 3 | B-value | Beginning of value |
| 4 | I-value | Inside value |
| 5 | B-header | Beginning of header |
| 6 | I-header | Inside header |

## Usage

```python
from transformers import AutoTokenizer, LiltForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("bluecopa/lilt-ser-document-extraction")
model = LiltForTokenClassification.from_pretrained("bluecopa/lilt-ser-document-extraction")

# words and boxes from OCR
words = ["Invoice", "No:", "12345", "Date:", "2024-01-15"]
boxes = [[100, 50, 200, 70], [210, 50, 250, 70], [260, 50, 350, 70],
         [100, 80, 180, 100], [190, 80, 320, 100]]

encoding = tokenizer(words, boxes=boxes, is_split_into_words=True,
                     return_tensors="pt", padding=True)
outputs = model(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()
```

## Architecture

- **Base model:** SCUT-DLVCLab/lilt-infoxlm-base (284M params, multilingual)
- **Task:** Token classification (7 labels)
- **Input:** words + bounding boxes (normalized 0-1000)
- **Context:** 1024 tokens (2x default, covers any single page)
- **Inference:** ~50ms/page on CPU via ONNX