| --- |
| language: en |
| license: mit |
| tags: |
| - layoutlmv3 |
| - invoice-parsing |
| - document-understanding |
| - token-classification |
| - ner |
| - pdf |
| base_model: microsoft/layoutlmv3-base |
| pipeline_tag: token-classification |
| --- |
| |
| # PDF Invoice Parser — Fine-tuned LayoutLMv3 |
|
|
| A fine-tuned [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) model for named entity recognition (NER) on PDF invoices. It extracts structured fields such as invoice number, dates, vendor/customer details, and financial totals directly from document pages using text, layout (bounding boxes), and visual features. |
|
|
| ## Model Details |
|
|
| - **Base model:** `microsoft/layoutlmv3-base` |
| - **Architecture:** `LayoutLMv3ForTokenClassification` |
| - **Task:** Token classification (NER) |
| - **Fine-tuned on:** Labeled PDF invoice pages |
|
|
| ## Labels |
|
|
| | Label | Description | |
| |---|---| |
| | `B/I-INVOICE_NUM` | Invoice number | |
| | `B/I-INVOICE_DATE` | Invoice date | |
| | `B/I-DUE_DATE` | Payment due date | |
| | `B/I-VENDOR_NAME` | Vendor / seller name | |
| | `B/I-VENDOR_ADDR` | Vendor address | |
| | `B/I-CUST_NAME` | Customer / buyer name | |
| | `B/I-CUST_ADDR` | Customer address | |
| | `B/I-TOTAL` | Total amount | |
| | `B/I-SUBTOTAL` | Subtotal amount | |
| | `B/I-TAX` | Tax amount | |
| | `O` | Outside / no entity | |
|
|
| ## Quick Start |
|
|
| ```bash |
| pip install transformers torch Pillow |
| ``` |
|
|
| ```python |
| from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification |
| import torch |
| from PIL import Image |
| |
| processor = LayoutLMv3Processor.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser", apply_ocr=False) |
| model = LayoutLMv3ForTokenClassification.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser") |
| model.eval() |
| |
| # words and boxes come from your OCR tool (e.g. pytesseract) |
| encoding = processor( |
| image, # PIL.Image of the invoice page |
| words, # list of word strings |
| boxes=boxes, # list of [x0, y0, x1, y1] normalized to 0–1000 |
| return_tensors="pt", |
| truncation=True, |
| padding="max_length", |
| max_length=512, |
| ) |
| |
| with torch.no_grad(): |
| outputs = model(**encoding) |
| |
| predictions = outputs.logits.argmax(-1).squeeze().tolist() |
| id2label = model.config.id2label |
| predicted_labels = [id2label[p] for p in predictions] |
| ``` |
|
|
| ## Full Pipeline (PDF → JSON) |
|
|
| ```python |
| from invoice_parser import InvoiceParser |
| |
| parser = InvoiceParser(strategy="finetuned") |
| result = parser.parse("invoice.pdf") |
| print(result.to_json()) |
| ``` |
|
|
| ## Output Format |
|
|
| ```json |
| { |
| "invoice_number": "INV-2024-0042", |
| "invoice_date": "March 15, 2024", |
| "due_date": "April 15, 2024", |
| "vendor_name": "Acme Corp", |
| "vendor_address": "123 Business St, City", |
| "customer_name": "Client LLC", |
| "customer_address": "456 Client Ave, Town", |
| "subtotal": 1200.00, |
| "tax": 216.00, |
| "total": 1416.00 |
| } |
| ``` |
|
|
| ## Extraction Strategies (invoice_parser.py) |
| |
| | Strategy | Speed | Accuracy | Best For | |
| |---|---|---|---| |
| | `pdfplumber` | Fast | Good | Digital/typed PDFs | |
| | `ocr` | Moderate | Good | Scanned PDFs | |
| | `finetuned` | Moderate | Very Good | Complex layouts (this model) | |
| | `claude` | Moderate | Excellent | Any PDF (needs API key) | |
| |
| ## Training |
| |
| Fine-tuned using `train_model.py` on labeled invoice annotations produced by `label_invoices.py`. |
| |
| ```bash |
| python train_model.py --annotations annotations/ --output trained_model/ --epochs 15 |
| ``` |
| |
| ## License |
| |
| MIT |
| |