Kapilydv6's picture
Upload README.md with huggingface_hub
e464cfc verified
---
language: en
license: mit
tags:
- layoutlmv3
- invoice-parsing
- document-understanding
- token-classification
- ner
- pdf
base_model: microsoft/layoutlmv3-base
pipeline_tag: token-classification
---
# PDF Invoice Parser — Fine-tuned LayoutLMv3
A fine-tuned [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) model for named entity recognition (NER) on PDF invoices. It extracts structured fields such as invoice number, dates, vendor/customer details, and financial totals directly from document pages using text, layout (bounding boxes), and visual features.
## Model Details
- **Base model:** `microsoft/layoutlmv3-base`
- **Architecture:** `LayoutLMv3ForTokenClassification`
- **Task:** Token classification (NER)
- **Fine-tuned on:** Labeled PDF invoice pages
## Labels
| Label | Description |
|---|---|
| `B/I-INVOICE_NUM` | Invoice number |
| `B/I-INVOICE_DATE` | Invoice date |
| `B/I-DUE_DATE` | Payment due date |
| `B/I-VENDOR_NAME` | Vendor / seller name |
| `B/I-VENDOR_ADDR` | Vendor address |
| `B/I-CUST_NAME` | Customer / buyer name |
| `B/I-CUST_ADDR` | Customer address |
| `B/I-TOTAL` | Total amount |
| `B/I-SUBTOTAL` | Subtotal amount |
| `B/I-TAX` | Tax amount |
| `O` | Outside / no entity |
## Quick Start
```bash
pip install transformers torch Pillow
```
```python
from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
import torch
from PIL import Image
processor = LayoutLMv3Processor.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser", apply_ocr=False)
model = LayoutLMv3ForTokenClassification.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser")
model.eval()
# words and boxes come from your OCR tool (e.g. pytesseract)
encoding = processor(
image, # PIL.Image of the invoice page
words, # list of word strings
boxes=boxes, # list of [x0, y0, x1, y1] normalized to 0–1000
return_tensors="pt",
truncation=True,
padding="max_length",
max_length=512,
)
with torch.no_grad():
outputs = model(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()
id2label = model.config.id2label
predicted_labels = [id2label[p] for p in predictions]
```
## Full Pipeline (PDF → JSON)
```python
from invoice_parser import InvoiceParser
parser = InvoiceParser(strategy="finetuned")
result = parser.parse("invoice.pdf")
print(result.to_json())
```
## Output Format
```json
{
"invoice_number": "INV-2024-0042",
"invoice_date": "March 15, 2024",
"due_date": "April 15, 2024",
"vendor_name": "Acme Corp",
"vendor_address": "123 Business St, City",
"customer_name": "Client LLC",
"customer_address": "456 Client Ave, Town",
"subtotal": 1200.00,
"tax": 216.00,
"total": 1416.00
}
```
## Extraction Strategies (invoice_parser.py)
| Strategy | Speed | Accuracy | Best For |
|---|---|---|---|
| `pdfplumber` | Fast | Good | Digital/typed PDFs |
| `ocr` | Moderate | Good | Scanned PDFs |
| `finetuned` | Moderate | Very Good | Complex layouts (this model) |
| `claude` | Moderate | Excellent | Any PDF (needs API key) |
## Training
Fine-tuned using `train_model.py` on labeled invoice annotations produced by `label_invoices.py`.
```bash
python train_model.py --annotations annotations/ --output trained_model/ --epochs 15
```
## License
MIT