GSoumyajit2005 commited on
Commit
65fe9aa
·
1 Parent(s): b477631

docs: Add prominent live demo section, update tech stack with CI/CD, document database scaffold

Browse files
Files changed (1) hide show
  1. README.md +62 -19
README.md CHANGED
@@ -19,21 +19,36 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
19
  ![Transformers](https://img.shields.io/badge/Transformers-4.x-purple.svg)
20
  ![PyTorch](https://img.shields.io/badge/PyTorch-2.x-orange.svg)
21
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ---
23
 
24
  ## 🎯 Features
25
 
26
  ### 🧠 Core Intelligence
 
27
  - **Hybrid Inference Engine:** Automatically triggers a Regex Fallback Engine if the ML model (LayoutLMv3) returns low confidence or missing critical fields (Invoice #, Date).
28
  - **ML-Based Extraction:** Fine-tuned `LayoutLMv3` Transformer for semantic understanding of complex layouts (SROIE dataset).
29
  - **Rule-Based Fallback:** Deterministic regex patterns ensure 100% coverage for standard fields when ML is uncertain.
30
 
31
  ### 🛡️ Robustness & Engineering
 
32
  - **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
33
  - **Cross-Platform OCR:** Dynamic Tesseract path discovery that works out-of-the-box on Windows (Local) and Linux (Docker/Production).
34
  - **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
35
 
36
  ### 💻 Usability
 
37
  - **Streamlit Web UI:** Interactive dashboard for real-time inference, visualization, and side-by-side comparison (ML vs. Regex).
38
  - **CLI & Batch Processing:** Process single files or entire directories via command line with JSON export.
39
  - **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.
@@ -45,12 +60,15 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
45
  ## 🛠️ Technical Deep Dive (Why this architecture?)
46
 
47
  ### 1. The "Safety Net" Fallback Logic
 
48
  Standard ML models often fail on specific fields like "Invoice Number" if the layout is unseen. This system implements a **priority-based extraction**:
 
49
  1. **Primary:** LayoutLMv3 predicts entity labels (context-aware).
50
  2. **Fallback:** If `Invoice_No` or `Total` is null, the system executes a targeted Regex scan on the raw text.
51
- *Result:* Combines the generalization of AI with the determinism of Rules.
52
 
53
  ### 2. Robustness & Error Handling
 
54
  - **OCR Noise:** Handles common Tesseract errors (e.g., reading "1nvoice" as "Invoice").
55
  - **Coordinate Normalization:** A custom `clamp()` function ensures all bounding boxes stay strictly within [0, 1000] to prevent Transformer index errors.
56
 
@@ -78,18 +96,22 @@ The system outputs a clean JSON with the following fields:
78
  ## 📊 Demo
79
 
80
  ### Web Interface
 
81
  ![Homepage](docs/screenshots/homepage.png)
82
- *Clean upload → extract flow with method selector (ML vs Regex).*
83
 
84
  ### Successful Extraction (ML-based)
 
85
  ![Success Result](docs/screenshots/success_result.png)
86
- *Fields extracted with LayoutLMv3.*
87
 
88
  ### Format Detection (simulated)
 
89
  ![Format Detection](docs/screenshots/format_detection.png)
90
- *UI shows simple format hints and confidence.*
91
 
92
  ### Example JSON (Rule-based)
 
93
  ```json
94
  {
95
  "receipt_number": "PEGIV-1030765",
@@ -108,6 +130,7 @@ The system outputs a clean JSON with the following fields:
108
  ```
109
 
110
  ### Example JSON (ML-based)
 
111
  ```json
112
  {
113
  "receipt_number": null,
@@ -131,6 +154,7 @@ The system outputs a clean JSON with the following fields:
131
  ## 🚀 Quick Start
132
 
133
  ### Prerequisites
 
134
  - Python 3.10+
135
  - Tesseract OCR
136
  - (Optional) CUDA-capable GPU for training/inference speed
@@ -138,6 +162,7 @@ The system outputs a clean JSON with the following fields:
138
  ### Installation
139
 
140
  1. Clone the repository
 
141
  ```bash
142
  git clone https://github.com/GSoumyajit2005/invoice-processor-ml
143
  cd invoice-processor-ml
@@ -146,22 +171,27 @@ cd invoice-processor-ml
146
  2. Create and Activate Virtual Environment (Recommended) Ensures the correct Python version and isolates dependencies.
147
 
148
  - **Linux / macOS**:
 
149
  ```bash
150
  python3 -m venv venv
151
  source venv/bin/activate
152
  ```
 
153
  - **Windows**:
 
154
  ```bash
155
  python -m venv venv
156
  .\venv\Scripts\activate
157
  ```
158
 
159
  3. Install dependencies
 
160
  ```bash
161
  pip install -r requirements.txt
162
  ```
163
 
164
  4. Install Tesseract OCR
 
165
  - **Windows**: Download from [UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki)
166
  - **Mac**: `brew install tesseract`
167
  - **Linux**: `sudo apt install tesseract-ocr`
@@ -169,16 +199,19 @@ pip install -r requirements.txt
169
  5. Tesseract Configuration (Auto-Detected) The system automatically detects Tesseract on both Windows (Registry/Standard Paths) and Linux (`/usr/bin/tesseract`). No manual configuration is required in `src/ocr.py`.
170
 
171
  6. Run the web app
 
172
  ```bash
173
  streamlit run app.py
174
  ```
175
 
176
  ### Training the Model (Optional)
 
177
  To retrain the model from scratch using the provided scripts:
178
 
179
  ```bash
180
  python scripts/train_combined.py
181
  ```
 
182
  (Note: Requires SROIE dataset in data/sroie)
183
 
184
  ## 💻 Usage
@@ -190,10 +223,11 @@ The easiest way to use the processor is via the web interface.
190
  ```bash
191
  streamlit run app.py
192
  ```
 
193
  - Upload an invoice image (PNG/JPG).
194
  - Choose extraction method in sidebar:
195
- - ML-Based (LayoutLMv3)
196
- - Rule-Based (Regex)
197
  - View JSON, download results.
198
 
199
  ### Command-Line Interface (CLI)
@@ -303,7 +337,10 @@ invoice-processor-ml/
303
  │ ├── ocr.py # Tesseract OCR integration
304
  │ ├── extraction.py # Regex-based information extraction logic
305
  │ ├── ml_extraction.py # ML-based extraction (LayoutLMv3)
306
- └── pipeline.py # Main orchestrator for the pipeline and CLI
 
 
 
307
 
308
  ├── tests/
309
  │ ├── test_preprocessing.py # Tests for the preprocessing module
@@ -318,7 +355,7 @@ invoice-processor-ml/
318
  ## 🧠 Model & Training
319
 
320
  - **Model**: `microsoft/layoutlmv3-base` (125M params)
321
- - **Task**: Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
322
  - **Dataset**: SROIE (ICDAR 2019, English retail receipts), mychen76/invoices-and-receipts_ocr_v1 (English)
323
  - **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
324
  - **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)
@@ -332,8 +369,8 @@ invoice-processor-ml/
332
  - **OCR accuracy (clear images)**: High with Tesseract
333
  - **Rule-based extraction**: Strong on simple retail receipts
334
  - **ML-based extraction (SROIE-style)**:
335
- - COMPANY / ADDRESS / DATE / TOTAL: High F1 on simple receipts
336
- - Complex business invoices: Partial extraction unless further fine-tuned
337
 
338
  ## ⚠️ Known Limitations
339
 
@@ -350,19 +387,23 @@ invoice-processor-ml/
350
  - [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
351
  - [ ] PDF support (pdf2image) for multipage invoices
352
  - [x] FastAPI backend + Docker
 
353
  - [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
354
  - [ ] Confidence calibration and better validation rules
 
355
 
356
  ## 🛠️ Tech Stack
357
 
358
- | Component | Technology |
359
- |-----------|------------|
360
- | OCR | Tesseract 5.0+ |
361
- | Image Processing | OpenCV, Pillow |
362
- | ML/NLP | PyTorch 2.x, Transformers |
363
- | Model | LayoutLMv3 (token class.) |
364
- | Web Interface | Streamlit |
365
- | Data Format | JSON |
 
 
366
 
367
  ## 📚 What I Learned
368
 
@@ -376,6 +417,7 @@ invoice-processor-ml/
376
  ## 🤝 Contributing
377
 
378
  Contributions welcome! Areas needing improvement:
 
379
  - New patterns for regex extractor
380
  - Better preprocessing for OCR
381
  - New datasets and training configs
@@ -388,9 +430,10 @@ MIT License - See LICENSE file for details
388
  ## 👨‍💻 Author
389
 
390
  **Soumyajit Ghosh** - 3rd Year BTech Student
 
391
  - Exploring AI/ML and practical applications
392
  - [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-tech) | [GitHub](https://github.com/GSoumyajit2005) | [Portfolio](#) (Coming Soon)
393
 
394
  ---
395
 
396
- **Note**: "This is a learning project demonstrating an end-to-end ML pipeline. Not recommended for production use without further validation, retraining on diverse datasets, and security hardening."
 
19
  ![Transformers](https://img.shields.io/badge/Transformers-4.x-purple.svg)
20
  ![PyTorch](https://img.shields.io/badge/PyTorch-2.x-orange.svg)
21
 
22
+ [![🤗 Live Demo](https://img.shields.io/badge/🤗%20Live%20Demo-Hugging%20Face%20Spaces-yellow.svg)](https://huggingface.co/spaces/GSoumyajit2005/invoice-processor-ml)
23
+
24
+ ---
25
+
26
+ ## 🚀 Try it Live!
27
+
28
+ > **No installation required!** Try the full application instantly on Hugging Face Spaces:
29
+ >
30
+ > ### 👉 [**Launch Live Demo**](https://huggingface.co/spaces/GSoumyajit2005/invoice-processor-ml) 👈
31
+ >
32
+ > Upload any invoice image and watch the hybrid ML+Regex engine extract structured data in real-time.
33
+
34
  ---
35
 
36
  ## 🎯 Features
37
 
38
  ### 🧠 Core Intelligence
39
+
40
  - **Hybrid Inference Engine:** Automatically triggers a Regex Fallback Engine if the ML model (LayoutLMv3) returns low confidence or missing critical fields (Invoice #, Date).
41
  - **ML-Based Extraction:** Fine-tuned `LayoutLMv3` Transformer for semantic understanding of complex layouts (SROIE dataset).
42
  - **Rule-Based Fallback:** Deterministic regex patterns ensure 100% coverage for standard fields when ML is uncertain.
43
 
44
  ### 🛡️ Robustness & Engineering
45
+
46
  - **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
47
  - **Cross-Platform OCR:** Dynamic Tesseract path discovery that works out-of-the-box on Windows (Local) and Linux (Docker/Production).
48
  - **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
49
 
50
  ### 💻 Usability
51
+
52
  - **Streamlit Web UI:** Interactive dashboard for real-time inference, visualization, and side-by-side comparison (ML vs. Regex).
53
  - **CLI & Batch Processing:** Process single files or entire directories via command line with JSON export.
54
  - **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.
 
60
  ## 🛠️ Technical Deep Dive (Why this architecture?)
61
 
62
  ### 1. The "Safety Net" Fallback Logic
63
+
64
  Standard ML models often fail on specific fields like "Invoice Number" if the layout is unseen. This system implements a **priority-based extraction**:
65
+
66
  1. **Primary:** LayoutLMv3 predicts entity labels (context-aware).
67
  2. **Fallback:** If `Invoice_No` or `Total` is null, the system executes a targeted Regex scan on the raw text.
68
+ _Result:_ Combines the generalization of AI with the determinism of Rules.
69
 
70
  ### 2. Robustness & Error Handling
71
+
72
  - **OCR Noise:** Handles common Tesseract errors (e.g., reading "1nvoice" as "Invoice").
73
  - **Coordinate Normalization:** A custom `clamp()` function ensures all bounding boxes stay strictly within [0, 1000] to prevent Transformer index errors.
74
 
 
96
  ## 📊 Demo
97
 
98
  ### Web Interface
99
+
100
  ![Homepage](docs/screenshots/homepage.png)
101
+ _Clean upload → extract flow with method selector (ML vs Regex)._
102
 
103
  ### Successful Extraction (ML-based)
104
+
105
  ![Success Result](docs/screenshots/success_result.png)
106
+ _Fields extracted with LayoutLMv3._
107
 
108
  ### Format Detection (simulated)
109
+
110
  ![Format Detection](docs/screenshots/format_detection.png)
111
+ _UI shows simple format hints and confidence._
112
 
113
  ### Example JSON (Rule-based)
114
+
115
  ```json
116
  {
117
  "receipt_number": "PEGIV-1030765",
 
130
  ```
131
 
132
  ### Example JSON (ML-based)
133
+
134
  ```json
135
  {
136
  "receipt_number": null,
 
154
  ## 🚀 Quick Start
155
 
156
  ### Prerequisites
157
+
158
  - Python 3.10+
159
  - Tesseract OCR
160
  - (Optional) CUDA-capable GPU for training/inference speed
 
162
  ### Installation
163
 
164
  1. Clone the repository
165
+
166
  ```bash
167
  git clone https://github.com/GSoumyajit2005/invoice-processor-ml
168
  cd invoice-processor-ml
 
171
  2. Create and Activate Virtual Environment (Recommended) Ensures the correct Python version and isolates dependencies.
172
 
173
  - **Linux / macOS**:
174
+
175
  ```bash
176
  python3 -m venv venv
177
  source venv/bin/activate
178
  ```
179
+
180
  - **Windows**:
181
+
182
  ```bash
183
  python -m venv venv
184
  .\venv\Scripts\activate
185
  ```
186
 
187
  3. Install dependencies
188
+
189
  ```bash
190
  pip install -r requirements.txt
191
  ```
192
 
193
  4. Install Tesseract OCR
194
+
195
  - **Windows**: Download from [UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki)
196
  - **Mac**: `brew install tesseract`
197
  - **Linux**: `sudo apt install tesseract-ocr`
 
199
  5. Tesseract Configuration (Auto-Detected) The system automatically detects Tesseract on both Windows (Registry/Standard Paths) and Linux (`/usr/bin/tesseract`). No manual configuration is required in `src/ocr.py`.
200
 
201
  6. Run the web app
202
+
203
  ```bash
204
  streamlit run app.py
205
  ```
206
 
207
  ### Training the Model (Optional)
208
+
209
  To retrain the model from scratch using the provided scripts:
210
 
211
  ```bash
212
  python scripts/train_combined.py
213
  ```
214
+
215
  (Note: Requires SROIE dataset in data/sroie)
216
 
217
  ## 💻 Usage
 
223
  ```bash
224
  streamlit run app.py
225
  ```
226
+
227
  - Upload an invoice image (PNG/JPG).
228
  - Choose extraction method in sidebar:
229
+ - ML-Based (LayoutLMv3)
230
+ - Rule-Based (Regex)
231
  - View JSON, download results.
232
 
233
  ### Command-Line Interface (CLI)
 
337
  │ ├── ocr.py # Tesseract OCR integration
338
  │ ├── extraction.py # Regex-based information extraction logic
339
  │ ├── ml_extraction.py # ML-based extraction (LayoutLMv3)
340
+ ├── pipeline.py # Main orchestrator for the pipeline and CLI
341
+ │ ├── database.py # PostgreSQL connection (scaffolded)
342
+ │ ├── models.py # SQLModel tables for persistence (scaffolded)
343
+ │ └── repository.py # CRUD operations for invoices (scaffolded)
344
 
345
  ├── tests/
346
  │ ├── test_preprocessing.py # Tests for the preprocessing module
 
355
  ## 🧠 Model & Training
356
 
357
  - **Model**: `microsoft/layoutlmv3-base` (125M params)
358
+ - **Task**: Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
359
  - **Dataset**: SROIE (ICDAR 2019, English retail receipts), mychen76/invoices-and-receipts_ocr_v1 (English)
360
  - **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
361
  - **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)
 
369
  - **OCR accuracy (clear images)**: High with Tesseract
370
  - **Rule-based extraction**: Strong on simple retail receipts
371
  - **ML-based extraction (SROIE-style)**:
372
+ - COMPANY / ADDRESS / DATE / TOTAL: High F1 on simple receipts
373
+ - Complex business invoices: Partial extraction unless further fine-tuned
374
 
375
  ## ⚠️ Known Limitations
376
 
 
387
  - [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
388
  - [ ] PDF support (pdf2image) for multipage invoices
389
  - [x] FastAPI backend + Docker
390
+ - [x] CI/CD pipeline (GitHub Actions → HuggingFace Spaces auto-deploy)
391
  - [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
392
  - [ ] Confidence calibration and better validation rules
393
+ - [ ] Database persistence layer (PostgreSQL - scaffolded, ready for implementation)
394
 
395
  ## 🛠️ Tech Stack
396
 
397
+ | Component | Technology |
398
+ | ---------------- | ----------------------------------- |
399
+ | OCR | Tesseract 5.0+ |
400
+ | Image Processing | OpenCV, Pillow |
401
+ | ML/NLP | PyTorch 2.x, Transformers |
402
+ | Model | LayoutLMv3 (token class.) |
403
+ | Web Interface | Streamlit |
404
+ | Data Format | JSON |
405
+ | CI/CD | GitHub Actions → HuggingFace Spaces |
406
+ | Containerization | Docker |
407
 
408
  ## 📚 What I Learned
409
 
 
417
  ## 🤝 Contributing
418
 
419
  Contributions welcome! Areas needing improvement:
420
+
421
  - New patterns for regex extractor
422
  - Better preprocessing for OCR
423
  - New datasets and training configs
 
430
  ## 👨‍💻 Author
431
 
432
  **Soumyajit Ghosh** - 3rd Year BTech Student
433
+
434
  - Exploring AI/ML and practical applications
435
  - [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-tech) | [GitHub](https://github.com/GSoumyajit2005) | [Portfolio](#) (Coming Soon)
436
 
437
  ---
438
 
439
+ **Note**: "This is a learning project demonstrating an end-to-end ML pipeline. Not recommended for production use without further validation, retraining on diverse datasets, and security hardening."