Kapilydv6
/

layoutlmv3-invoice-parser

Token Classification

invoice-parsing

document-understanding

Model card Files Files and versions

layoutlmv3-invoice-parser / README.md

Kapilydv6's picture

Upload README.md with huggingface_hub

e464cfc verified 16 days ago

|

history blame contribute delete

3.32 kB

	---
	language: en
	license: mit
	tags:
	- layoutlmv3
	- invoice-parsing
	- document-understanding
	- token-classification
	- ner
	- pdf
	base_model: microsoft/layoutlmv3-base
	pipeline_tag: token-classification
	---

	# PDF Invoice Parser — Fine-tuned LayoutLMv3

	A fine-tuned [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) model for named entity recognition (NER) on PDF invoices. It extracts structured fields such as invoice number, dates, vendor/customer details, and financial totals directly from document pages using text, layout (bounding boxes), and visual features.

	## Model Details

	- Base model: `microsoft/layoutlmv3-base`
	- Architecture: `LayoutLMv3ForTokenClassification`
	- Task: Token classification (NER)
	- Fine-tuned on: Labeled PDF invoice pages

	## Labels

	\| Label \| Description \|
	\|---\|---\|
	\| `B/I-INVOICE_NUM` \| Invoice number \|
	\| `B/I-INVOICE_DATE` \| Invoice date \|
	\| `B/I-DUE_DATE` \| Payment due date \|
	\| `B/I-VENDOR_NAME` \| Vendor / seller name \|
	\| `B/I-VENDOR_ADDR` \| Vendor address \|
	\| `B/I-CUST_NAME` \| Customer / buyer name \|
	\| `B/I-CUST_ADDR` \| Customer address \|
	\| `B/I-TOTAL` \| Total amount \|
	\| `B/I-SUBTOTAL` \| Subtotal amount \|
	\| `B/I-TAX` \| Tax amount \|
	\| `O` \| Outside / no entity \|

	## Quick Start

	```bash
	pip install transformers torch Pillow
	```

	```python
	from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
	import torch
	from PIL import Image

	processor = LayoutLMv3Processor.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser", apply_ocr=False)
	model = LayoutLMv3ForTokenClassification.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser")
	model.eval()

	# words and boxes come from your OCR tool (e.g. pytesseract)
	encoding = processor(
	image, # PIL.Image of the invoice page
	words, # list of word strings
	boxes=boxes, # list of [x0, y0, x1, y1] normalized to 0–1000
	return_tensors="pt",
	truncation=True,
	padding="max_length",
	max_length=512,
	)

	with torch.no_grad():
	outputs = model(**encoding)

	predictions = outputs.logits.argmax(-1).squeeze().tolist()
	id2label = model.config.id2label
	predicted_labels = [id2label[p] for p in predictions]
	```

	## Full Pipeline (PDF → JSON)

	```python
	from invoice_parser import InvoiceParser

	parser = InvoiceParser(strategy="finetuned")
	result = parser.parse("invoice.pdf")
	print(result.to_json())
	```

	## Output Format

	```json
	{
	"invoice_number": "INV-2024-0042",
	"invoice_date": "March 15, 2024",
	"due_date": "April 15, 2024",
	"vendor_name": "Acme Corp",
	"vendor_address": "123 Business St, City",
	"customer_name": "Client LLC",
	"customer_address": "456 Client Ave, Town",
	"subtotal": 1200.00,
	"tax": 216.00,
	"total": 1416.00
	}
	```

	## Extraction Strategies (invoice_parser.py)

	\| Strategy \| Speed \| Accuracy \| Best For \|
	\|---\|---\|---\|---\|
	\| `pdfplumber` \| Fast \| Good \| Digital/typed PDFs \|
	\| `ocr` \| Moderate \| Good \| Scanned PDFs \|
	\| `finetuned` \| Moderate \| Very Good \| Complex layouts (this model) \|
	\| `claude` \| Moderate \| Excellent \| Any PDF (needs API key) \|

	## Training

	Fine-tuned using `train_model.py` on labeled invoice annotations produced by `label_invoices.py`.

	```bash
	python train_model.py --annotations annotations/ --output trained_model/ --epochs 15
	```

	## License

	MIT