File size: 3,318 Bytes
e464cfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
language: en
license: mit
tags:
  - layoutlmv3
  - invoice-parsing
  - document-understanding
  - token-classification
  - ner
  - pdf
base_model: microsoft/layoutlmv3-base
pipeline_tag: token-classification
---

# PDF Invoice Parser — Fine-tuned LayoutLMv3

A fine-tuned [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) model for named entity recognition (NER) on PDF invoices. It extracts structured fields such as invoice number, dates, vendor/customer details, and financial totals directly from document pages using text, layout (bounding boxes), and visual features.

## Model Details

- **Base model:** `microsoft/layoutlmv3-base`
- **Architecture:** `LayoutLMv3ForTokenClassification`
- **Task:** Token classification (NER)
- **Fine-tuned on:** Labeled PDF invoice pages

## Labels

| Label | Description |
|---|---|
| `B/I-INVOICE_NUM` | Invoice number |
| `B/I-INVOICE_DATE` | Invoice date |
| `B/I-DUE_DATE` | Payment due date |
| `B/I-VENDOR_NAME` | Vendor / seller name |
| `B/I-VENDOR_ADDR` | Vendor address |
| `B/I-CUST_NAME` | Customer / buyer name |
| `B/I-CUST_ADDR` | Customer address |
| `B/I-TOTAL` | Total amount |
| `B/I-SUBTOTAL` | Subtotal amount |
| `B/I-TAX` | Tax amount |
| `O` | Outside / no entity |

## Quick Start

```bash
pip install transformers torch Pillow
```

```python
from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
import torch
from PIL import Image

processor = LayoutLMv3Processor.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser", apply_ocr=False)
model = LayoutLMv3ForTokenClassification.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser")
model.eval()

# words and boxes come from your OCR tool (e.g. pytesseract)
encoding = processor(
    image,          # PIL.Image of the invoice page
    words,          # list of word strings
    boxes=boxes,    # list of [x0, y0, x1, y1] normalized to 0–1000
    return_tensors="pt",
    truncation=True,
    padding="max_length",
    max_length=512,
)

with torch.no_grad():
    outputs = model(**encoding)

predictions = outputs.logits.argmax(-1).squeeze().tolist()
id2label = model.config.id2label
predicted_labels = [id2label[p] for p in predictions]
```

## Full Pipeline (PDF → JSON)

```python
from invoice_parser import InvoiceParser

parser = InvoiceParser(strategy="finetuned")
result = parser.parse("invoice.pdf")
print(result.to_json())
```

## Output Format

```json
{
  "invoice_number": "INV-2024-0042",
  "invoice_date": "March 15, 2024",
  "due_date": "April 15, 2024",
  "vendor_name": "Acme Corp",
  "vendor_address": "123 Business St, City",
  "customer_name": "Client LLC",
  "customer_address": "456 Client Ave, Town",
  "subtotal": 1200.00,
  "tax": 216.00,
  "total": 1416.00
}
```

## Extraction Strategies (invoice_parser.py)

| Strategy | Speed | Accuracy | Best For |
|---|---|---|---|
| `pdfplumber` | Fast | Good | Digital/typed PDFs |
| `ocr` | Moderate | Good | Scanned PDFs |
| `finetuned` | Moderate | Very Good | Complex layouts (this model) |
| `claude` | Moderate | Excellent | Any PDF (needs API key) |

## Training

Fine-tuned using `train_model.py` on labeled invoice annotations produced by `label_invoices.py`.

```bash
python train_model.py --annotations annotations/ --output trained_model/ --epochs 15
```

## License

MIT