--- language: en license: cc-by-4.0 tags: - document-understanding - layout - token-classification - lilt datasets: - bluecopa/samyx-document-ser metrics: - f1 pipeline_tag: token-classification --- # LiLT SER — Document Key/Value Extraction Fine-tuned [LiLT](https://huggingface.co/SCUT-DLVCLab/lilt-infoxlm-base) for **Semantic Entity Recognition** on documents. Given words + bounding boxes from any document, labels each word as: key, value, header, or other. ## Results | Benchmark | Metric | Score | |-----------|--------|-------| | SROIE (626 receipts) | Recall | **98.8%** | | SROIE Total field | Recall | **96.3%** | | SROIE Company field | Recall | **100%** | | FUNSD (50 forms) | Token Accuracy | **79.7%** | | Val set (3,577 docs) | Entity F1 | **97.3%** | ## Training Data Trained on 17,881 documents from 8 sources: - FATURA (10K invoices), CORD (1K receipts), WildReceipt (1.7K receipts) - VRDU Registration + Ad-Buy forms (2.6K government documents) - Multi-type OCR forms + invoices (1.2K) - Invoice OCR (1.4K multi-layout invoices) Dataset: [bluecopa/samyx-document-ser](https://huggingface.co/datasets/bluecopa/samyx-document-ser) ## Labels | ID | Label | Description | |----|-------|-------------| | 0 | O | Background / other | | 1 | B-key | Beginning of key/label | | 2 | I-key | Inside key/label | | 3 | B-value | Beginning of value | | 4 | I-value | Inside value | | 5 | B-header | Beginning of header | | 6 | I-header | Inside header | ## Usage ```python from transformers import AutoTokenizer, LiltForTokenClassification tokenizer = AutoTokenizer.from_pretrained("bluecopa/lilt-ser-document-extraction") model = LiltForTokenClassification.from_pretrained("bluecopa/lilt-ser-document-extraction") # words and boxes from OCR words = ["Invoice", "No:", "12345", "Date:", "2024-01-15"] boxes = [[100, 50, 200, 70], [210, 50, 250, 70], [260, 50, 350, 70], [100, 80, 180, 100], [190, 80, 320, 100]] encoding = tokenizer(words, boxes=boxes, is_split_into_words=True, return_tensors="pt", padding=True) outputs = model(**encoding) predictions = outputs.logits.argmax(-1).squeeze().tolist() ``` ## Architecture - **Base model:** SCUT-DLVCLab/lilt-infoxlm-base (284M params, multilingual) - **Task:** Token classification (7 labels) - **Input:** words + bounding boxes (normalized 0-1000) - **Context:** 1024 tokens (2x default, covers any single page) - **Inference:** ~50ms/page on CPU via ONNX