File size: 20,708 Bytes
2aa6ef7
 
 
 
 
 
 
 
7630bcd
2aa6ef7
 
d79b7f7
 
5d04abb
d79b7f7
 
 
e81f779
d79b7f7
 
 
65fe9aa
 
 
 
 
 
 
 
 
 
 
 
d79b7f7
 
 
 
5d04abb
65fe9aa
5d04abb
 
 
 
 
65fe9aa
5d04abb
e81f779
5d04abb
90dbe20
 
 
5d04abb
 
65fe9aa
5d04abb
90dbe20
5d04abb
 
 
 
1144bea
5d04abb
 
 
 
 
65fe9aa
5d04abb
65fe9aa
5d04abb
 
65fe9aa
5d04abb
 
65fe9aa
e81f779
5d04abb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1144bea
2a944a5
 
 
 
 
 
 
 
d79b7f7
 
 
 
 
65fe9aa
d79b7f7
65fe9aa
d79b7f7
 
65fe9aa
d79b7f7
65fe9aa
d79b7f7
 
65fe9aa
d79b7f7
65fe9aa
d79b7f7
 
65fe9aa
d79b7f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1144bea
d79b7f7
65fe9aa
d79b7f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65fe9aa
d79b7f7
2a944a5
 
 
 
 
d79b7f7
2a944a5
d79b7f7
2a944a5
65fe9aa
d79b7f7
 
 
 
 
2a944a5
65fe9aa
5d04abb
2a944a5
 
5d04abb
1144bea
2a944a5
65fe9aa
d79b7f7
2a944a5
 
 
 
d79b7f7
 
e81f779
65fe9aa
d79b7f7
 
 
 
2a944a5
 
 
1144bea
65fe9aa
1144bea
 
 
 
 
65fe9aa
1144bea
 
2a944a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d79b7f7
 
 
 
 
 
 
 
 
65fe9aa
d79b7f7
 
65fe9aa
 
d79b7f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e81f779
d79b7f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a944a5
 
 
 
 
 
 
 
 
 
d79b7f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1144bea
d79b7f7
 
e81f779
d79b7f7
 
 
1144bea
 
e81f779
 
 
 
1144bea
d79b7f7
e81f779
1144bea
2a944a5
e81f779
 
2a944a5
e81f779
 
 
2a944a5
e81f779
 
 
d79b7f7
1144bea
e81f779
 
 
 
d79b7f7
 
 
2a944a5
 
 
 
d79b7f7
 
 
 
 
 
65fe9aa
aa4f954
d79b7f7
e81f779
d79b7f7
 
1144bea
e81f779
d79b7f7
 
 
e81f779
2a944a5
 
 
 
 
 
 
 
d79b7f7
 
 
ec0b507
5d04abb
d79b7f7
2a944a5
d79b7f7
 
 
1144bea
d79b7f7
 
 
90dbe20
1144bea
65fe9aa
d79b7f7
 
2a944a5
d79b7f7
 
 
65fe9aa
 
e81f779
65fe9aa
 
 
 
 
 
 
2a944a5
 
d79b7f7
 
 
 
 
 
 
 
 
 
 
 
 
65fe9aa
d79b7f7
 
 
 
 
 
 
 
 
 
 
 
65fe9aa
d79b7f7
343b0c3
d79b7f7
 
 
65fe9aa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
---

title: Invoice Processor Ml
emoji: 
colorFrom: indigo
colorTo: pink
sdk: docker
pinned: false
license: mit
short_description: Hybrid invoice extraction using LayoutLMv3 and Regex
---


# 📄 Smart Invoice Processor

A production-grade Hybrid Invoice Extraction System that combines the semantic understanding of LayoutLMv3 with the precision of Regex Heuristics. Designed for robustness, it features a Dual-Engine Architecture with automatic fallback logic to ensure 100% extraction coverage for business-critical fields (Invoice #, Date, Total) even when the AI model is uncertain.

![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)
![Streamlit](https://img.shields.io/badge/Streamlit-1.51+-red.svg)
![DocTR](https://img.shields.io/badge/DocTR-0.9+-green.svg)
![Transformers](https://img.shields.io/badge/Transformers-4.x-purple.svg)
![PyTorch](https://img.shields.io/badge/PyTorch-2.x-orange.svg)

[![🤗 Live Demo](https://img.shields.io/badge/🤗%20Live%20Demo-Hugging%20Face%20Spaces-yellow.svg)](https://huggingface.co/spaces/GSoumyajit2005/invoice-processor-ml)

---

## 🚀 Try it Live!

> **No installation required!** Try the full application instantly on Hugging Face Spaces:
>

> ### 👉 [**Launch Live Demo**](https://huggingface.co/spaces/GSoumyajit2005/invoice-processor-ml) 👈
>

> Upload any invoice image and watch the hybrid ML+Regex engine extract structured data in real-time.

---

## 🎯 Features

### 🧠 Core Intelligence

- **Hybrid Inference Engine:** Automatically triggers a Regex Fallback Engine if the ML model (LayoutLMv3) returns low confidence or missing critical fields (Invoice #, Date).
- **ML-Based Extraction:** Fine-tuned `LayoutLMv3` Transformer for semantic understanding of complex layouts (SROIE dataset).
- **Rule-Based Fallback:** Deterministic regex patterns ensure 100% coverage for standard fields when ML is uncertain.

### 🛡️ Robustness & Engineering

- **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
- **GPU-Accelerated OCR:** DocTR (Mindee) with automatic CUDA acceleration for faster inference in production.
- **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
- **Defensive Persistence:** Optional PostgreSQL integration (local Docker or cloud Supabase) that automatically saves extracted data when credentials are present, but gracefully degrades (skips saving) in serverless/demo environments.
- **Async Database Saves:** Background thread processing ensures fast UI response (~5-7s) while database operations happen asynchronously.
- **Duplicate Prevention:** Implemented _Semantic Hashing_ (Vendor + Date + Total + ID) to automatically detect and prevent duplicate invoice entries.

### 💻 Usability

- **Streamlit Web UI:** Interactive dashboard for real-time inference, visualization, and side-by-side comparison (ML vs. Regex).
- **PDF Preview & Overlay:** Visual preview of uploaded PDFs with ML-detected bounding boxes overlay for transparency.
- **CLI & Batch Processing:** Process single files or entire directories via command line with JSON export.
- **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.

> Note on Invoice Numbers: The SROIE dataset used for training does not include "Invoice Number" labels. To solve this, the system uses a Hybrid Fallback Mechanism: if the ML model (LayoutLMv3) returns null for the Invoice Number, the system automatically triggers a targeted Regex extraction to ensure this critical field is captured.

---

## 🛠️ Technical Deep Dive (Why this architecture?)

### 1. The "Safety Net" Fallback Logic

Standard ML models often fail on specific fields like "Invoice Number" if the layout is unseen. This system implements a **priority-based extraction**:

1.  **Primary:** LayoutLMv3 predicts entity labels (context-aware).
2.  **Fallback:** If `Invoice_No` or `Total` is null, the system executes a targeted Regex scan on the raw text.
    _Result:_ Combines the generalization of AI with the determinism of Rules.


### 2. Robustness & Error Handling

- **OCR Noise:** Uses DocTR's deep learning-based text recognition for improved accuracy over traditional OCR.
- **Coordinate Normalization:** A custom `clamp()` function ensures all bounding boxes stay strictly within [0, 1000] to prevent Transformer index errors.

### 3. Dual-Engine Architecture

The system implements a **Dual-Engine Architecture** with automatic fallback logic:

1. **Primary Engine:** LayoutLMv3 predicts entity labels (context-aware).
2. **Fallback Engine:** If `Invoice_No` or `Total` is null, the system executes a targeted Regex scan on the raw text.

### 4. Clean JSON Output

The system outputs a clean JSON with the following fields:

- `receipt_number`: The invoice number (extracted by LayoutLMv3 or Regex).
- `date`: The invoice date (extracted by LayoutLMv3 or Regex).
- `bill_to`: The bill-to information (extracted by LayoutLMv3 or Regex).
- `items`: The list of items (extracted by LayoutLMv3 or Regex).
- `total_amount`: The total amount (extracted by LayoutLMv3 or Regex).
- `extraction_confidence`: The confidence of the extraction (0-100).
- `validation_passed`: Whether the validation passed (true/false).

### 5. Defensive Database Architecture

To support both local development (with full persistence) and lightweight cloud demos (without databases), the system uses a **"Soft Fail" Persistence Layer**:

1. **Connection Check:** On startup, the system checks for PostgreSQL credentials. If missing, the database engine is disabled.
2. **Repository Guard:** All CRUD operations check for an active session. If the database is disabled, save operations are skipped silently without crashing the pipeline.
3. **Semantic Hashing:** Before saving, a content-based hash is generated to ensure idempotency.

---

## 📊 Demo

### Web Interface

![Homepage](docs/screenshots/homepage.png)
_Clean upload → extract flow with method selector (ML vs Regex)._

### Successful Extraction (ML-based)

![Success Result](docs/screenshots/success_result.png)
_Fields extracted with LayoutLMv3._

### Format Detection (simulated)

![Format Detection](docs/screenshots/format_detection.png)
_UI shows simple format hints and confidence._

### Example JSON (Rule-based)

```json

{

  "receipt_number": "PEGIV-1030765",

  "date": "15/01/2019",

  "bill_to": {

    "name": "THE PEAK QUARRY WORKS",

    "email": null

  },

  "items": [],

  "total_amount": 193.0,

  "extraction_confidence": 100,

  "validation_passed": true,

  "vendor": "OJC MARKETING SDN BHD",

  "address": "NO JALAN BAYU 4, BANDAR SERI ALAM, 81750 MASAI, JOHOR"

}

```

### Example JSON (ML-based)

```json

{

  "receipt_number": null,

  "date": "15/01/2019",

  "bill_to": null,

  "items": [],

  "total_amount": 193.0,

  "vendor": "OJC MARKETING SDN BHD",

  "address": "NO JALAN BAYU 4, BANDAR SERI ALAM, 81750 MASAI, JOHOR",

  "raw_text": "…",

  "raw_ocr_words": ["…"],

  "raw_predictions": {

    "DATE": {"text": "15/01/2019", "bbox": [[…]]},

    "TOTAL": {"text": "193.00", "bbox": [[…]]},

    "COMPANY": {"text": "OJC MARKETING SDN BHD", "bbox": [[…]]},

    "ADDRESS": {"text": "…", "bbox": [[…]]}

  }

}

```

## 🚀 Quick Start

### Prerequisites

- Python 3.10+
- Conda / Miniforge (recommended)
- NVIDIA GPU with CUDA (strongly recommended for usable performance)

⚠️ CPU-only execution is supported but significantly slower
(5–10s per invoice) and intended only for testing.

### Installation (Conda – Recommended)

1. Clone the repository:

```bash

git clone https://github.com/GSoumyajit2005/invoice-processor-ml

cd invoice-processor-ml

```

2. Create and activate the Conda environment:

```bash

conda env create -f environment.yml

conda activate invoice-ml

```

3. Verify CUDA availability (recommended):

```bash

python - <<EOF

import torch

print(torch.cuda.is_available())

EOF

```

4. Run the web app

```bash

streamlit run app.py

```

> Note: `requirements.txt` is consumed internally by `environment.yml`.
> Do not install it manually with pip.

### Training the Model (Optional)

To retrain the model from scratch using the provided scripts:

```bash

python scripts/train_combined.py

```

(Note: Requires SROIE dataset in data/sroie)

### API Usage (Optional)

To run the API server:

```bash

python src/api.py

```

The API provides endpoints for processing invoices and extracting information.

### Running with Database (Optional)

To enable data persistence, run the included Docker Compose file to spin up PostgreSQL:

```bash

docker-compose up -d

```

The application will automatically detect the database and start saving invoices.

## 💻 Usage

### Web Interface (Recommended)

The easiest way to use the processor is via the web interface.

```bash

streamlit run app.py

```

- Upload an invoice image (PNG/JPG).
- Choose extraction method in sidebar:
  - ML-Based (LayoutLMv3)
  - Rule-Based (Regex)
- View JSON, download results.

### Command-Line Interface (CLI)

You can also process invoices directly from the command line.

#### 1. Processing a Single Invoice

This command processes the provided sample invoice and prints the results to the console.

```bash

python src/pipeline.py data/samples/sample_invoice.jpg --save --method ml

# or

python src/pipeline.py data/samples/sample_invoice.jpg --save --method rules

```

#### 2. Batch Processing a Folder

The CLI can process an entire folder of images at once.

First, place your own invoice images (e.g., `my_invoice1.jpg`, `my_invoice2.png`) into the `data/raw/` folder.

Then, run the following command. It will process all images in `data/raw/`. Saved files are written to `outputs/{stem}_{method}.json`.

```bash

python src/pipeline.py data/raw --save --method ml

```

### Python API

You can integrate the pipeline directly into your own Python scripts.

```python

from src.pipeline import process_invoice

import json



result = process_invoice('data/samples/sample_invoice.jpg', method='ml')

print(json.dumps(result, indent=2))

```

## 🏗️ Architecture

```

                           ┌────────────────┐

                           │  Upload Image  │

                           └───────┬────────┘



                         ┌────────────────────┐

                         │   Preprocessing    │  (OpenCV grayscale/denoise)

                         └────────┬───────────┘



                          ┌───────────────┐

                          │     OCR       │  (DocTR)

                          └───────┬───────┘


                   ┌──────────────┴──────────────┐

                   │                             │

                   ▼                             ▼

         ┌──────────────────┐           ┌────────────────────────┐

         │  Rule-based IE   │           │   ML-based IE (NER)    │

         │  (regex, heur.)  │           │ LayoutLMv3 token-class │

         └────────┬─────────┘           └───────────┬────────────┘

                  │                                 │

                  └──────────────┬──────────────────┘


                         ┌──────────────────┐

                         │   Post-process   │

                         │ validate, scores │

                         └────────┬─────────┘


                   ┌──────────────┴──────────────┐

                   │                             │

                   ▼                             ▼

         ┌──────────────────┐         ┌────────────────────┐

         │    JSON Output   │         │  DB (PostgreSQL)   │

         └──────────────────┘         │   (Optional Save)  │

                                      └────────────────────┘





```

## 📁 Project Structure

```

invoice-processor-ml/


├── data/

│   ├── raw/                    # Input invoice images for processing

│   └── processed/              # (Reserved for future use)


├── data/samples/

│   └── sample_invoice.jpg      # Public sample for quick testing


├── docs/

│   └── screenshots/            # UI Screenshots for the README demo


├── models/

│   └── layoutlmv3-doctr-trained/      # Fine-tuned model (trained with DocTR OCR)


├── outputs/                    # Default folder for saved JSON results


├── scripts/                    # Training and analysis scripts

│   ├── eval_new_dataset.py     # Evaluation scripts

│   ├── explore_new_dataset.py  # Dataset exploration tools

│   ├── prepare_doctr_data.py   # DocTR data alignment for training

│   ├── train_combined.py       # Main training loop (SROIE + Custom Data)

│   └── train_layoutlm.py       # LayoutLMv3 fine-tuning script


├── src/

│   ├── api.py                  # FastAPI REST endpoint for API access

│   ├── data_loader.py          # Unified data loader for training

│   ├── database.py             # Database connection with environment-aware 'soft fail' check

│   ├── extraction.py           # Regex-based information extraction logic

│   ├── ml_extraction.py        # ML-based extraction (LayoutLMv3 + DocTR)

│   ├── models.py               # SQLModel tables (Invoice, LineItem) with schema validation

│   ├── pdf_utils.py            # PDF text extraction and image conversion

│   ├── pipeline.py             # Main orchestrator for the pipeline and CLI

│   ├── preprocessing.py        # Image preprocessing functions (grayscale, denoise)

│   ├── repository.py           # CRUD operations with session safety handling

│   ├── schema.py               # Pydantic models for API response validation

│   ├── sroie_loader.py         # SROIE dataset loading logic

│   └── utils.py                # Utility functions (semantic hashing, etc.)


├── tests/

│   ├── test_extraction.py      # Tests for regex extraction module

│   ├── test_full_pipeline.py   # Full end-to-end integration tests

│   ├── test_pipeline.py        # Pipeline process tests

│   └── test_preprocessing.py   # Tests for the preprocessing module


├── app.py                      # Streamlit web interface

├── requirements.txt            # Python dependencies

├── environment.yml             # Conda environment configuration

├── docker-compose.yml          # Docker Compose configuration for PostgreSQL

├── Dockerfile                  # Dockerfile for building the application container

├── .gitignore                  # Git ignore file

└── README.md                   # You are Here!

```

## 🧠 Model & Training

- **Model**: `microsoft/layoutlmv3-base` (125M params)
- **Task**: Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
- **Dataset**: SROIE (ICDAR 2019, English retail receipts), mychen76/invoices-and-receipts_ocr_v1 (English)
- **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
- **Result**: F1 Score ≈ 0.83 (Real-world performance on DocTR-aligned validation set)

- Training scripts (local):
- `scripts/train_combined.py` (data prep, training loop with validation + model save)
- Model saved to: `models/layoutlmv3-doctr-trained/`

## 📈 Performance

- **OCR Precision**: State-of-the-art hierarchical detection using **DocTR (ResNet-50)**. Outperforms Tesseract on complex/noisy layouts.
- **ML-based Extraction**:
  - **Accuracy**: ~83% F1 Score on SROIE + custom invoices
  - **Speed**:
    - **GPU (recommended)**: <1s per invoice
    - **CPU (fallback)**: ~5–7s per invoice

⚠️ CPU-only execution is supported for testing and experimentation but results
in significantly higher latency due to the heavy OCR and layout-aware models.

## ⚠️ Known Limitations

1. **Layout Sensitivity**: The ML model was fine‑tuned on SROIE (retail receipts) and mychen76/invoices-and-receipts_ocr_v1 (English). Professional multi-column invoices may underperform until you fine‑tune on more diverse datasets.
2. **Invoice Number**: SROIE dataset lacks invoice number labels. The system solves this by using the Hybrid Fallback Engine, which successfully extracts invoice numbers using Regex whenever the ML model output is empty.
3. **Line Items/Tables**: Not trained for table extraction yet. Rule-based supports simple totals; table extraction comes later.
4. **Inference Latency**: CPU execution is significantly slower due to heavy OCR and layout-aware models.

## 🔮 Future Enhancements

- [x] Add and fine‑tune on mychen76/invoices-and-receipts_ocr_v1 (English) for broader invoice formats
- [ ] (Optional) Add FATURA (table-focused) for line-item extraction
- [ ] Sliding-window chunking for >512 token documents (to avoid truncation)
- [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
- [x] PDF support (pdf2image) for multipage invoices
- [x] FastAPI backend + Docker
- [x] CI/CD pipeline (GitHub Actions → HuggingFace Spaces auto-deploy)
- [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
- [ ] Confidence calibration and better validation rules
- [x] Database persistence layer (PostgreSQL with SQLModel & Redundancy checks)

## 🛠️ Tech Stack

| Component        | Technology                          |
| ---------------- | ----------------------------------- |
| OCR              | DocTR (Mindee)                      |
| Image Processing | OpenCV, Pillow                      |
| ML/NLP           | PyTorch 2.x, Transformers           |
| Model            | LayoutLMv3 (token class.)           |
| Web Interface    | Streamlit                           |
| Data Format      | JSON                                |
| CI/CD            | GitHub Actions → HuggingFace Spaces |
| Containerization | Docker                              |
| Database         | PostgreSQL, SQLModel                |
| Containerization | Docker & Docker Compose             |

## 📚 What I Learned

- OCR challenges (confusable characters, confidence-based filtering)
- Layout-aware NER with LayoutLMv3 (text + bbox + pixels)
- Data normalization (bbox to 0–1000 scale)
- End-to-end pipelines (UI + CLI + JSON output)
- When regex is enough vs when ML is needed
- Evaluation (seqeval F1 for NER)

## 🤝 Contributing

Contributions welcome! Areas needing improvement:

- New patterns for regex extractor
- Better preprocessing for OCR
- New datasets and training configs
- Tests and CI

## 📝 License

MIT License - See LICENSE file for details

## 👨‍💻 Author

**Soumyajit Ghosh** - 3rd Year BTech Student

- Exploring AI/ML and practical applications
- [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-tech) | [GitHub](https://github.com/GSoumyajit2005) | [Portfolio](https://soumyajitghosh.vercel.app)

---

**Note**: "This is a learning project demonstrating an end-to-end ML pipeline. Not recommended for production use without further validation, retraining on diverse datasets, and security hardening."