Kapilydv6 commited on
Commit
e464cfc
·
verified ·
1 Parent(s): 47c86d5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +122 -0
README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - layoutlmv3
6
+ - invoice-parsing
7
+ - document-understanding
8
+ - token-classification
9
+ - ner
10
+ - pdf
11
+ base_model: microsoft/layoutlmv3-base
12
+ pipeline_tag: token-classification
13
+ ---
14
+
15
+ # PDF Invoice Parser — Fine-tuned LayoutLMv3
16
+
17
+ A fine-tuned [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) model for named entity recognition (NER) on PDF invoices. It extracts structured fields such as invoice number, dates, vendor/customer details, and financial totals directly from document pages using text, layout (bounding boxes), and visual features.
18
+
19
+ ## Model Details
20
+
21
+ - **Base model:** `microsoft/layoutlmv3-base`
22
+ - **Architecture:** `LayoutLMv3ForTokenClassification`
23
+ - **Task:** Token classification (NER)
24
+ - **Fine-tuned on:** Labeled PDF invoice pages
25
+
26
+ ## Labels
27
+
28
+ | Label | Description |
29
+ |---|---|
30
+ | `B/I-INVOICE_NUM` | Invoice number |
31
+ | `B/I-INVOICE_DATE` | Invoice date |
32
+ | `B/I-DUE_DATE` | Payment due date |
33
+ | `B/I-VENDOR_NAME` | Vendor / seller name |
34
+ | `B/I-VENDOR_ADDR` | Vendor address |
35
+ | `B/I-CUST_NAME` | Customer / buyer name |
36
+ | `B/I-CUST_ADDR` | Customer address |
37
+ | `B/I-TOTAL` | Total amount |
38
+ | `B/I-SUBTOTAL` | Subtotal amount |
39
+ | `B/I-TAX` | Tax amount |
40
+ | `O` | Outside / no entity |
41
+
42
+ ## Quick Start
43
+
44
+ ```bash
45
+ pip install transformers torch Pillow
46
+ ```
47
+
48
+ ```python
49
+ from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
50
+ import torch
51
+ from PIL import Image
52
+
53
+ processor = LayoutLMv3Processor.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser", apply_ocr=False)
54
+ model = LayoutLMv3ForTokenClassification.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser")
55
+ model.eval()
56
+
57
+ # words and boxes come from your OCR tool (e.g. pytesseract)
58
+ encoding = processor(
59
+ image, # PIL.Image of the invoice page
60
+ words, # list of word strings
61
+ boxes=boxes, # list of [x0, y0, x1, y1] normalized to 0–1000
62
+ return_tensors="pt",
63
+ truncation=True,
64
+ padding="max_length",
65
+ max_length=512,
66
+ )
67
+
68
+ with torch.no_grad():
69
+ outputs = model(**encoding)
70
+
71
+ predictions = outputs.logits.argmax(-1).squeeze().tolist()
72
+ id2label = model.config.id2label
73
+ predicted_labels = [id2label[p] for p in predictions]
74
+ ```
75
+
76
+ ## Full Pipeline (PDF → JSON)
77
+
78
+ ```python
79
+ from invoice_parser import InvoiceParser
80
+
81
+ parser = InvoiceParser(strategy="finetuned")
82
+ result = parser.parse("invoice.pdf")
83
+ print(result.to_json())
84
+ ```
85
+
86
+ ## Output Format
87
+
88
+ ```json
89
+ {
90
+ "invoice_number": "INV-2024-0042",
91
+ "invoice_date": "March 15, 2024",
92
+ "due_date": "April 15, 2024",
93
+ "vendor_name": "Acme Corp",
94
+ "vendor_address": "123 Business St, City",
95
+ "customer_name": "Client LLC",
96
+ "customer_address": "456 Client Ave, Town",
97
+ "subtotal": 1200.00,
98
+ "tax": 216.00,
99
+ "total": 1416.00
100
+ }
101
+ ```
102
+
103
+ ## Extraction Strategies (invoice_parser.py)
104
+
105
+ | Strategy | Speed | Accuracy | Best For |
106
+ |---|---|---|---|
107
+ | `pdfplumber` | Fast | Good | Digital/typed PDFs |
108
+ | `ocr` | Moderate | Good | Scanned PDFs |
109
+ | `finetuned` | Moderate | Very Good | Complex layouts (this model) |
110
+ | `claude` | Moderate | Excellent | Any PDF (needs API key) |
111
+
112
+ ## Training
113
+
114
+ Fine-tuned using `train_model.py` on labeled invoice annotations produced by `label_invoices.py`.
115
+
116
+ ```bash
117
+ python train_model.py --annotations annotations/ --output trained_model/ --epochs 15
118
+ ```
119
+
120
+ ## License
121
+
122
+ MIT