Commit
·
65fe9aa
1
Parent(s):
b477631
docs: Add prominent live demo section, update tech stack with CI/CD, document database scaffold
Browse files
README.md
CHANGED
|
@@ -19,21 +19,36 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
|
|
| 19 |

|
| 20 |

|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
---
|
| 23 |
|
| 24 |
## 🎯 Features
|
| 25 |
|
| 26 |
### 🧠 Core Intelligence
|
|
|
|
| 27 |
- **Hybrid Inference Engine:** Automatically triggers a Regex Fallback Engine if the ML model (LayoutLMv3) returns low confidence or missing critical fields (Invoice #, Date).
|
| 28 |
- **ML-Based Extraction:** Fine-tuned `LayoutLMv3` Transformer for semantic understanding of complex layouts (SROIE dataset).
|
| 29 |
- **Rule-Based Fallback:** Deterministic regex patterns ensure 100% coverage for standard fields when ML is uncertain.
|
| 30 |
|
| 31 |
### 🛡️ Robustness & Engineering
|
|
|
|
| 32 |
- **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
|
| 33 |
- **Cross-Platform OCR:** Dynamic Tesseract path discovery that works out-of-the-box on Windows (Local) and Linux (Docker/Production).
|
| 34 |
- **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
|
| 35 |
|
| 36 |
### 💻 Usability
|
|
|
|
| 37 |
- **Streamlit Web UI:** Interactive dashboard for real-time inference, visualization, and side-by-side comparison (ML vs. Regex).
|
| 38 |
- **CLI & Batch Processing:** Process single files or entire directories via command line with JSON export.
|
| 39 |
- **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.
|
|
@@ -45,12 +60,15 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
|
|
| 45 |
## 🛠️ Technical Deep Dive (Why this architecture?)
|
| 46 |
|
| 47 |
### 1. The "Safety Net" Fallback Logic
|
|
|
|
| 48 |
Standard ML models often fail on specific fields like "Invoice Number" if the layout is unseen. This system implements a **priority-based extraction**:
|
|
|
|
| 49 |
1. **Primary:** LayoutLMv3 predicts entity labels (context-aware).
|
| 50 |
2. **Fallback:** If `Invoice_No` or `Total` is null, the system executes a targeted Regex scan on the raw text.
|
| 51 |
-
|
| 52 |
|
| 53 |
### 2. Robustness & Error Handling
|
|
|
|
| 54 |
- **OCR Noise:** Handles common Tesseract errors (e.g., reading "1nvoice" as "Invoice").
|
| 55 |
- **Coordinate Normalization:** A custom `clamp()` function ensures all bounding boxes stay strictly within [0, 1000] to prevent Transformer index errors.
|
| 56 |
|
|
@@ -78,18 +96,22 @@ The system outputs a clean JSON with the following fields:
|
|
| 78 |
## 📊 Demo
|
| 79 |
|
| 80 |
### Web Interface
|
|
|
|
| 81 |

|
| 82 |
-
|
| 83 |
|
| 84 |
### Successful Extraction (ML-based)
|
|
|
|
| 85 |

|
| 86 |
-
|
| 87 |
|
| 88 |
### Format Detection (simulated)
|
|
|
|
| 89 |

|
| 90 |
-
|
| 91 |
|
| 92 |
### Example JSON (Rule-based)
|
|
|
|
| 93 |
```json
|
| 94 |
{
|
| 95 |
"receipt_number": "PEGIV-1030765",
|
|
@@ -108,6 +130,7 @@ The system outputs a clean JSON with the following fields:
|
|
| 108 |
```
|
| 109 |
|
| 110 |
### Example JSON (ML-based)
|
|
|
|
| 111 |
```json
|
| 112 |
{
|
| 113 |
"receipt_number": null,
|
|
@@ -131,6 +154,7 @@ The system outputs a clean JSON with the following fields:
|
|
| 131 |
## 🚀 Quick Start
|
| 132 |
|
| 133 |
### Prerequisites
|
|
|
|
| 134 |
- Python 3.10+
|
| 135 |
- Tesseract OCR
|
| 136 |
- (Optional) CUDA-capable GPU for training/inference speed
|
|
@@ -138,6 +162,7 @@ The system outputs a clean JSON with the following fields:
|
|
| 138 |
### Installation
|
| 139 |
|
| 140 |
1. Clone the repository
|
|
|
|
| 141 |
```bash
|
| 142 |
git clone https://github.com/GSoumyajit2005/invoice-processor-ml
|
| 143 |
cd invoice-processor-ml
|
|
@@ -146,22 +171,27 @@ cd invoice-processor-ml
|
|
| 146 |
2. Create and Activate Virtual Environment (Recommended) Ensures the correct Python version and isolates dependencies.
|
| 147 |
|
| 148 |
- **Linux / macOS**:
|
|
|
|
| 149 |
```bash
|
| 150 |
python3 -m venv venv
|
| 151 |
source venv/bin/activate
|
| 152 |
```
|
|
|
|
| 153 |
- **Windows**:
|
|
|
|
| 154 |
```bash
|
| 155 |
python -m venv venv
|
| 156 |
.\venv\Scripts\activate
|
| 157 |
```
|
| 158 |
|
| 159 |
3. Install dependencies
|
|
|
|
| 160 |
```bash
|
| 161 |
pip install -r requirements.txt
|
| 162 |
```
|
| 163 |
|
| 164 |
4. Install Tesseract OCR
|
|
|
|
| 165 |
- **Windows**: Download from [UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki)
|
| 166 |
- **Mac**: `brew install tesseract`
|
| 167 |
- **Linux**: `sudo apt install tesseract-ocr`
|
|
@@ -169,16 +199,19 @@ pip install -r requirements.txt
|
|
| 169 |
5. Tesseract Configuration (Auto-Detected) The system automatically detects Tesseract on both Windows (Registry/Standard Paths) and Linux (`/usr/bin/tesseract`). No manual configuration is required in `src/ocr.py`.
|
| 170 |
|
| 171 |
6. Run the web app
|
|
|
|
| 172 |
```bash
|
| 173 |
streamlit run app.py
|
| 174 |
```
|
| 175 |
|
| 176 |
### Training the Model (Optional)
|
|
|
|
| 177 |
To retrain the model from scratch using the provided scripts:
|
| 178 |
|
| 179 |
```bash
|
| 180 |
python scripts/train_combined.py
|
| 181 |
```
|
|
|
|
| 182 |
(Note: Requires SROIE dataset in data/sroie)
|
| 183 |
|
| 184 |
## 💻 Usage
|
|
@@ -190,10 +223,11 @@ The easiest way to use the processor is via the web interface.
|
|
| 190 |
```bash
|
| 191 |
streamlit run app.py
|
| 192 |
```
|
|
|
|
| 193 |
- Upload an invoice image (PNG/JPG).
|
| 194 |
- Choose extraction method in sidebar:
|
| 195 |
-
|
| 196 |
-
|
| 197 |
- View JSON, download results.
|
| 198 |
|
| 199 |
### Command-Line Interface (CLI)
|
|
@@ -303,7 +337,10 @@ invoice-processor-ml/
|
|
| 303 |
│ ├── ocr.py # Tesseract OCR integration
|
| 304 |
│ ├── extraction.py # Regex-based information extraction logic
|
| 305 |
│ ├── ml_extraction.py # ML-based extraction (LayoutLMv3)
|
| 306 |
-
│
|
|
|
|
|
|
|
|
|
|
| 307 |
│
|
| 308 |
├── tests/
|
| 309 |
│ ├── test_preprocessing.py # Tests for the preprocessing module
|
|
@@ -318,7 +355,7 @@ invoice-processor-ml/
|
|
| 318 |
## 🧠 Model & Training
|
| 319 |
|
| 320 |
- **Model**: `microsoft/layoutlmv3-base` (125M params)
|
| 321 |
-
- **Task**:
|
| 322 |
- **Dataset**: SROIE (ICDAR 2019, English retail receipts), mychen76/invoices-and-receipts_ocr_v1 (English)
|
| 323 |
- **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
|
| 324 |
- **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)
|
|
@@ -332,8 +369,8 @@ invoice-processor-ml/
|
|
| 332 |
- **OCR accuracy (clear images)**: High with Tesseract
|
| 333 |
- **Rule-based extraction**: Strong on simple retail receipts
|
| 334 |
- **ML-based extraction (SROIE-style)**:
|
| 335 |
-
|
| 336 |
-
|
| 337 |
|
| 338 |
## ⚠️ Known Limitations
|
| 339 |
|
|
@@ -350,19 +387,23 @@ invoice-processor-ml/
|
|
| 350 |
- [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
|
| 351 |
- [ ] PDF support (pdf2image) for multipage invoices
|
| 352 |
- [x] FastAPI backend + Docker
|
|
|
|
| 353 |
- [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
|
| 354 |
- [ ] Confidence calibration and better validation rules
|
|
|
|
| 355 |
|
| 356 |
## 🛠️ Tech Stack
|
| 357 |
|
| 358 |
-
| Component
|
| 359 |
-
|
| 360 |
-
| OCR
|
| 361 |
-
| Image Processing | OpenCV, Pillow
|
| 362 |
-
| ML/NLP
|
| 363 |
-
| Model
|
| 364 |
-
| Web Interface
|
| 365 |
-
| Data Format
|
|
|
|
|
|
|
| 366 |
|
| 367 |
## 📚 What I Learned
|
| 368 |
|
|
@@ -376,6 +417,7 @@ invoice-processor-ml/
|
|
| 376 |
## 🤝 Contributing
|
| 377 |
|
| 378 |
Contributions welcome! Areas needing improvement:
|
|
|
|
| 379 |
- New patterns for regex extractor
|
| 380 |
- Better preprocessing for OCR
|
| 381 |
- New datasets and training configs
|
|
@@ -388,9 +430,10 @@ MIT License - See LICENSE file for details
|
|
| 388 |
## 👨💻 Author
|
| 389 |
|
| 390 |
**Soumyajit Ghosh** - 3rd Year BTech Student
|
|
|
|
| 391 |
- Exploring AI/ML and practical applications
|
| 392 |
- [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-tech) | [GitHub](https://github.com/GSoumyajit2005) | [Portfolio](#) (Coming Soon)
|
| 393 |
|
| 394 |
---
|
| 395 |
|
| 396 |
-
**Note**: "This is a learning project demonstrating an end-to-end ML pipeline. Not recommended for production use without further validation, retraining on diverse datasets, and security hardening."
|
|
|
|
| 19 |

|
| 20 |

|
| 21 |
|
| 22 |
+
[](https://huggingface.co/spaces/GSoumyajit2005/invoice-processor-ml)
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## 🚀 Try it Live!
|
| 27 |
+
|
| 28 |
+
> **No installation required!** Try the full application instantly on Hugging Face Spaces:
|
| 29 |
+
>
|
| 30 |
+
> ### 👉 [**Launch Live Demo**](https://huggingface.co/spaces/GSoumyajit2005/invoice-processor-ml) 👈
|
| 31 |
+
>
|
| 32 |
+
> Upload any invoice image and watch the hybrid ML+Regex engine extract structured data in real-time.
|
| 33 |
+
|
| 34 |
---
|
| 35 |
|
| 36 |
## 🎯 Features
|
| 37 |
|
| 38 |
### 🧠 Core Intelligence
|
| 39 |
+
|
| 40 |
- **Hybrid Inference Engine:** Automatically triggers a Regex Fallback Engine if the ML model (LayoutLMv3) returns low confidence or missing critical fields (Invoice #, Date).
|
| 41 |
- **ML-Based Extraction:** Fine-tuned `LayoutLMv3` Transformer for semantic understanding of complex layouts (SROIE dataset).
|
| 42 |
- **Rule-Based Fallback:** Deterministic regex patterns ensure 100% coverage for standard fields when ML is uncertain.
|
| 43 |
|
| 44 |
### 🛡️ Robustness & Engineering
|
| 45 |
+
|
| 46 |
- **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
|
| 47 |
- **Cross-Platform OCR:** Dynamic Tesseract path discovery that works out-of-the-box on Windows (Local) and Linux (Docker/Production).
|
| 48 |
- **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
|
| 49 |
|
| 50 |
### 💻 Usability
|
| 51 |
+
|
| 52 |
- **Streamlit Web UI:** Interactive dashboard for real-time inference, visualization, and side-by-side comparison (ML vs. Regex).
|
| 53 |
- **CLI & Batch Processing:** Process single files or entire directories via command line with JSON export.
|
| 54 |
- **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.
|
|
|
|
| 60 |
## 🛠️ Technical Deep Dive (Why this architecture?)
|
| 61 |
|
| 62 |
### 1. The "Safety Net" Fallback Logic
|
| 63 |
+
|
| 64 |
Standard ML models often fail on specific fields like "Invoice Number" if the layout is unseen. This system implements a **priority-based extraction**:
|
| 65 |
+
|
| 66 |
1. **Primary:** LayoutLMv3 predicts entity labels (context-aware).
|
| 67 |
2. **Fallback:** If `Invoice_No` or `Total` is null, the system executes a targeted Regex scan on the raw text.
|
| 68 |
+
_Result:_ Combines the generalization of AI with the determinism of Rules.
|
| 69 |
|
| 70 |
### 2. Robustness & Error Handling
|
| 71 |
+
|
| 72 |
- **OCR Noise:** Handles common Tesseract errors (e.g., reading "1nvoice" as "Invoice").
|
| 73 |
- **Coordinate Normalization:** A custom `clamp()` function ensures all bounding boxes stay strictly within [0, 1000] to prevent Transformer index errors.
|
| 74 |
|
|
|
|
| 96 |
## 📊 Demo
|
| 97 |
|
| 98 |
### Web Interface
|
| 99 |
+
|
| 100 |

|
| 101 |
+
_Clean upload → extract flow with method selector (ML vs Regex)._
|
| 102 |
|
| 103 |
### Successful Extraction (ML-based)
|
| 104 |
+
|
| 105 |

|
| 106 |
+
_Fields extracted with LayoutLMv3._
|
| 107 |
|
| 108 |
### Format Detection (simulated)
|
| 109 |
+
|
| 110 |

|
| 111 |
+
_UI shows simple format hints and confidence._
|
| 112 |
|
| 113 |
### Example JSON (Rule-based)
|
| 114 |
+
|
| 115 |
```json
|
| 116 |
{
|
| 117 |
"receipt_number": "PEGIV-1030765",
|
|
|
|
| 130 |
```
|
| 131 |
|
| 132 |
### Example JSON (ML-based)
|
| 133 |
+
|
| 134 |
```json
|
| 135 |
{
|
| 136 |
"receipt_number": null,
|
|
|
|
| 154 |
## 🚀 Quick Start
|
| 155 |
|
| 156 |
### Prerequisites
|
| 157 |
+
|
| 158 |
- Python 3.10+
|
| 159 |
- Tesseract OCR
|
| 160 |
- (Optional) CUDA-capable GPU for training/inference speed
|
|
|
|
| 162 |
### Installation
|
| 163 |
|
| 164 |
1. Clone the repository
|
| 165 |
+
|
| 166 |
```bash
|
| 167 |
git clone https://github.com/GSoumyajit2005/invoice-processor-ml
|
| 168 |
cd invoice-processor-ml
|
|
|
|
| 171 |
2. Create and Activate Virtual Environment (Recommended) Ensures the correct Python version and isolates dependencies.
|
| 172 |
|
| 173 |
- **Linux / macOS**:
|
| 174 |
+
|
| 175 |
```bash
|
| 176 |
python3 -m venv venv
|
| 177 |
source venv/bin/activate
|
| 178 |
```
|
| 179 |
+
|
| 180 |
- **Windows**:
|
| 181 |
+
|
| 182 |
```bash
|
| 183 |
python -m venv venv
|
| 184 |
.\venv\Scripts\activate
|
| 185 |
```
|
| 186 |
|
| 187 |
3. Install dependencies
|
| 188 |
+
|
| 189 |
```bash
|
| 190 |
pip install -r requirements.txt
|
| 191 |
```
|
| 192 |
|
| 193 |
4. Install Tesseract OCR
|
| 194 |
+
|
| 195 |
- **Windows**: Download from [UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki)
|
| 196 |
- **Mac**: `brew install tesseract`
|
| 197 |
- **Linux**: `sudo apt install tesseract-ocr`
|
|
|
|
| 199 |
5. Tesseract Configuration (Auto-Detected) The system automatically detects Tesseract on both Windows (Registry/Standard Paths) and Linux (`/usr/bin/tesseract`). No manual configuration is required in `src/ocr.py`.
|
| 200 |
|
| 201 |
6. Run the web app
|
| 202 |
+
|
| 203 |
```bash
|
| 204 |
streamlit run app.py
|
| 205 |
```
|
| 206 |
|
| 207 |
### Training the Model (Optional)
|
| 208 |
+
|
| 209 |
To retrain the model from scratch using the provided scripts:
|
| 210 |
|
| 211 |
```bash
|
| 212 |
python scripts/train_combined.py
|
| 213 |
```
|
| 214 |
+
|
| 215 |
(Note: Requires SROIE dataset in data/sroie)
|
| 216 |
|
| 217 |
## 💻 Usage
|
|
|
|
| 223 |
```bash
|
| 224 |
streamlit run app.py
|
| 225 |
```
|
| 226 |
+
|
| 227 |
- Upload an invoice image (PNG/JPG).
|
| 228 |
- Choose extraction method in sidebar:
|
| 229 |
+
- ML-Based (LayoutLMv3)
|
| 230 |
+
- Rule-Based (Regex)
|
| 231 |
- View JSON, download results.
|
| 232 |
|
| 233 |
### Command-Line Interface (CLI)
|
|
|
|
| 337 |
│ ├── ocr.py # Tesseract OCR integration
|
| 338 |
│ ├── extraction.py # Regex-based information extraction logic
|
| 339 |
│ ├── ml_extraction.py # ML-based extraction (LayoutLMv3)
|
| 340 |
+
│ ├── pipeline.py # Main orchestrator for the pipeline and CLI
|
| 341 |
+
│ ├── database.py # PostgreSQL connection (scaffolded)
|
| 342 |
+
│ ├── models.py # SQLModel tables for persistence (scaffolded)
|
| 343 |
+
│ └── repository.py # CRUD operations for invoices (scaffolded)
|
| 344 |
│
|
| 345 |
├── tests/
|
| 346 |
│ ├── test_preprocessing.py # Tests for the preprocessing module
|
|
|
|
| 355 |
## 🧠 Model & Training
|
| 356 |
|
| 357 |
- **Model**: `microsoft/layoutlmv3-base` (125M params)
|
| 358 |
+
- **Task**: Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
|
| 359 |
- **Dataset**: SROIE (ICDAR 2019, English retail receipts), mychen76/invoices-and-receipts_ocr_v1 (English)
|
| 360 |
- **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
|
| 361 |
- **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)
|
|
|
|
| 369 |
- **OCR accuracy (clear images)**: High with Tesseract
|
| 370 |
- **Rule-based extraction**: Strong on simple retail receipts
|
| 371 |
- **ML-based extraction (SROIE-style)**:
|
| 372 |
+
- COMPANY / ADDRESS / DATE / TOTAL: High F1 on simple receipts
|
| 373 |
+
- Complex business invoices: Partial extraction unless further fine-tuned
|
| 374 |
|
| 375 |
## ⚠️ Known Limitations
|
| 376 |
|
|
|
|
| 387 |
- [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
|
| 388 |
- [ ] PDF support (pdf2image) for multipage invoices
|
| 389 |
- [x] FastAPI backend + Docker
|
| 390 |
+
- [x] CI/CD pipeline (GitHub Actions → HuggingFace Spaces auto-deploy)
|
| 391 |
- [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
|
| 392 |
- [ ] Confidence calibration and better validation rules
|
| 393 |
+
- [ ] Database persistence layer (PostgreSQL - scaffolded, ready for implementation)
|
| 394 |
|
| 395 |
## 🛠️ Tech Stack
|
| 396 |
|
| 397 |
+
| Component | Technology |
|
| 398 |
+
| ---------------- | ----------------------------------- |
|
| 399 |
+
| OCR | Tesseract 5.0+ |
|
| 400 |
+
| Image Processing | OpenCV, Pillow |
|
| 401 |
+
| ML/NLP | PyTorch 2.x, Transformers |
|
| 402 |
+
| Model | LayoutLMv3 (token class.) |
|
| 403 |
+
| Web Interface | Streamlit |
|
| 404 |
+
| Data Format | JSON |
|
| 405 |
+
| CI/CD | GitHub Actions → HuggingFace Spaces |
|
| 406 |
+
| Containerization | Docker |
|
| 407 |
|
| 408 |
## 📚 What I Learned
|
| 409 |
|
|
|
|
| 417 |
## 🤝 Contributing
|
| 418 |
|
| 419 |
Contributions welcome! Areas needing improvement:
|
| 420 |
+
|
| 421 |
- New patterns for regex extractor
|
| 422 |
- Better preprocessing for OCR
|
| 423 |
- New datasets and training configs
|
|
|
|
| 430 |
## 👨💻 Author
|
| 431 |
|
| 432 |
**Soumyajit Ghosh** - 3rd Year BTech Student
|
| 433 |
+
|
| 434 |
- Exploring AI/ML and practical applications
|
| 435 |
- [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-tech) | [GitHub](https://github.com/GSoumyajit2005) | [Portfolio](#) (Coming Soon)
|
| 436 |
|
| 437 |
---
|
| 438 |
|
| 439 |
+
**Note**: "This is a learning project demonstrating an end-to-end ML pipeline. Not recommended for production use without further validation, retraining on diverse datasets, and security hardening."
|