--- language: en license: mit tags: - layoutlmv3 - invoice-parsing - document-understanding - token-classification - ner - pdf base_model: microsoft/layoutlmv3-base pipeline_tag: token-classification --- # PDF Invoice Parser — Fine-tuned LayoutLMv3 A fine-tuned [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) model for named entity recognition (NER) on PDF invoices. It extracts structured fields such as invoice number, dates, vendor/customer details, and financial totals directly from document pages using text, layout (bounding boxes), and visual features. ## Model Details - **Base model:** `microsoft/layoutlmv3-base` - **Architecture:** `LayoutLMv3ForTokenClassification` - **Task:** Token classification (NER) - **Fine-tuned on:** Labeled PDF invoice pages ## Labels | Label | Description | |---|---| | `B/I-INVOICE_NUM` | Invoice number | | `B/I-INVOICE_DATE` | Invoice date | | `B/I-DUE_DATE` | Payment due date | | `B/I-VENDOR_NAME` | Vendor / seller name | | `B/I-VENDOR_ADDR` | Vendor address | | `B/I-CUST_NAME` | Customer / buyer name | | `B/I-CUST_ADDR` | Customer address | | `B/I-TOTAL` | Total amount | | `B/I-SUBTOTAL` | Subtotal amount | | `B/I-TAX` | Tax amount | | `O` | Outside / no entity | ## Quick Start ```bash pip install transformers torch Pillow ``` ```python from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification import torch from PIL import Image processor = LayoutLMv3Processor.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser", apply_ocr=False) model = LayoutLMv3ForTokenClassification.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser") model.eval() # words and boxes come from your OCR tool (e.g. pytesseract) encoding = processor( image, # PIL.Image of the invoice page words, # list of word strings boxes=boxes, # list of [x0, y0, x1, y1] normalized to 0–1000 return_tensors="pt", truncation=True, padding="max_length", max_length=512, ) with torch.no_grad(): outputs = model(**encoding) predictions = outputs.logits.argmax(-1).squeeze().tolist() id2label = model.config.id2label predicted_labels = [id2label[p] for p in predictions] ``` ## Full Pipeline (PDF → JSON) ```python from invoice_parser import InvoiceParser parser = InvoiceParser(strategy="finetuned") result = parser.parse("invoice.pdf") print(result.to_json()) ``` ## Output Format ```json { "invoice_number": "INV-2024-0042", "invoice_date": "March 15, 2024", "due_date": "April 15, 2024", "vendor_name": "Acme Corp", "vendor_address": "123 Business St, City", "customer_name": "Client LLC", "customer_address": "456 Client Ave, Town", "subtotal": 1200.00, "tax": 216.00, "total": 1416.00 } ``` ## Extraction Strategies (invoice_parser.py) | Strategy | Speed | Accuracy | Best For | |---|---|---|---| | `pdfplumber` | Fast | Good | Digital/typed PDFs | | `ocr` | Moderate | Good | Scanned PDFs | | `finetuned` | Moderate | Very Good | Complex layouts (this model) | | `claude` | Moderate | Excellent | Any PDF (needs API key) | ## Training Fine-tuned using `train_model.py` on labeled invoice annotations produced by `label_invoices.py`. ```bash python train_model.py --annotations annotations/ --output trained_model/ --epochs 15 ``` ## License MIT