Kapilydv6
/

layoutlmv3-invoice-parser

+---
+language: en
+license: mit
+tags:
+  - layoutlmv3
+  - invoice-parsing
+  - document-understanding
+  - token-classification
+  - ner
+  - pdf
+base_model: microsoft/layoutlmv3-base
+pipeline_tag: token-classification
+---
+# PDF Invoice Parser — Fine-tuned LayoutLMv3
+A fine-tuned [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) model for named entity recognition (NER) on PDF invoices. It extracts structured fields such as invoice number, dates, vendor/customer details, and financial totals directly from document pages using text, layout (bounding boxes), and visual features.
+## Model Details
+- **Base model:** `microsoft/layoutlmv3-base`
+- **Architecture:** `LayoutLMv3ForTokenClassification`
+- **Task:** Token classification (NER)
+- **Fine-tuned on:** Labeled PDF invoice pages
+## Labels
+| Label | Description |
+|---|---|
+| `B/I-INVOICE_NUM` | Invoice number |
+| `B/I-INVOICE_DATE` | Invoice date |
+| `B/I-DUE_DATE` | Payment due date |
+| `B/I-VENDOR_NAME` | Vendor / seller name |
+| `B/I-VENDOR_ADDR` | Vendor address |
+| `B/I-CUST_NAME` | Customer / buyer name |
+| `B/I-CUST_ADDR` | Customer address |
+| `B/I-TOTAL` | Total amount |
+| `B/I-SUBTOTAL` | Subtotal amount |
+| `B/I-TAX` | Tax amount |
+| `O` | Outside / no entity |
+## Quick Start
+```bash
+pip install transformers torch Pillow
+```
+```python
+from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
+import torch
+from PIL import Image
+processor = LayoutLMv3Processor.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser", apply_ocr=False)
+model = LayoutLMv3ForTokenClassification.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser")
+model.eval()
+# words and boxes come from your OCR tool (e.g. pytesseract)
+encoding = processor(
+    image,          # PIL.Image of the invoice page
+    words,          # list of word strings
+    boxes=boxes,    # list of [x0, y0, x1, y1] normalized to 0–1000
+    return_tensors="pt",
+    truncation=True,
+    padding="max_length",
+    max_length=512,
+)
+with torch.no_grad():
+    outputs = model(**encoding)
+predictions = outputs.logits.argmax(-1).squeeze().tolist()
+id2label = model.config.id2label
+predicted_labels = [id2label[p] for p in predictions]
+```
+## Full Pipeline (PDF → JSON)
+```python
+from invoice_parser import InvoiceParser
+parser = InvoiceParser(strategy="finetuned")
+result = parser.parse("invoice.pdf")
+print(result.to_json())
+```
+## Output Format
+```json
+{
+  "invoice_number": "INV-2024-0042",
+  "invoice_date": "March 15, 2024",
+  "due_date": "April 15, 2024",
+  "vendor_name": "Acme Corp",
+  "vendor_address": "123 Business St, City",
+  "customer_name": "Client LLC",
+  "customer_address": "456 Client Ave, Town",
+  "subtotal": 1200.00,
+  "tax": 216.00,
+  "total": 1416.00
+}
+```
+## Extraction Strategies (invoice_parser.py)
+| Strategy | Speed | Accuracy | Best For |
+|---|---|---|---|
+| `pdfplumber` | Fast | Good | Digital/typed PDFs |
+| `ocr` | Moderate | Good | Scanned PDFs |
+| `finetuned` | Moderate | Very Good | Complex layouts (this model) |
+| `claude` | Moderate | Excellent | Any PDF (needs API key) |
+## Training
+Fine-tuned using `train_model.py` on labeled invoice annotations produced by `label_invoices.py`.
+```bash
+python train_model.py --annotations annotations/ --output trained_model/ --epochs 15
+```
+## License
+MIT