Spaces:
Sleeping
Sleeping
File size: 20,708 Bytes
2aa6ef7 7630bcd 2aa6ef7 d79b7f7 5d04abb d79b7f7 e81f779 d79b7f7 65fe9aa d79b7f7 5d04abb 65fe9aa 5d04abb 65fe9aa 5d04abb e81f779 5d04abb 90dbe20 5d04abb 65fe9aa 5d04abb 90dbe20 5d04abb 1144bea 5d04abb 65fe9aa 5d04abb 65fe9aa 5d04abb 65fe9aa 5d04abb 65fe9aa e81f779 5d04abb 1144bea 2a944a5 d79b7f7 65fe9aa d79b7f7 65fe9aa d79b7f7 65fe9aa d79b7f7 65fe9aa d79b7f7 65fe9aa d79b7f7 65fe9aa d79b7f7 65fe9aa d79b7f7 1144bea d79b7f7 65fe9aa d79b7f7 65fe9aa d79b7f7 2a944a5 d79b7f7 2a944a5 d79b7f7 2a944a5 65fe9aa d79b7f7 2a944a5 65fe9aa 5d04abb 2a944a5 5d04abb 1144bea 2a944a5 65fe9aa d79b7f7 2a944a5 d79b7f7 e81f779 65fe9aa d79b7f7 2a944a5 1144bea 65fe9aa 1144bea 65fe9aa 1144bea 2a944a5 d79b7f7 65fe9aa d79b7f7 65fe9aa d79b7f7 e81f779 d79b7f7 2a944a5 d79b7f7 1144bea d79b7f7 e81f779 d79b7f7 1144bea e81f779 1144bea d79b7f7 e81f779 1144bea 2a944a5 e81f779 2a944a5 e81f779 2a944a5 e81f779 d79b7f7 1144bea e81f779 d79b7f7 2a944a5 d79b7f7 65fe9aa aa4f954 d79b7f7 e81f779 d79b7f7 1144bea e81f779 d79b7f7 e81f779 2a944a5 d79b7f7 ec0b507 5d04abb d79b7f7 2a944a5 d79b7f7 1144bea d79b7f7 90dbe20 1144bea 65fe9aa d79b7f7 2a944a5 d79b7f7 65fe9aa e81f779 65fe9aa 2a944a5 d79b7f7 65fe9aa d79b7f7 65fe9aa d79b7f7 343b0c3 d79b7f7 65fe9aa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 |
---
title: Invoice Processor Ml
emoji: ⚡
colorFrom: indigo
colorTo: pink
sdk: docker
pinned: false
license: mit
short_description: Hybrid invoice extraction using LayoutLMv3 and Regex
---
# 📄 Smart Invoice Processor
A production-grade Hybrid Invoice Extraction System that combines the semantic understanding of LayoutLMv3 with the precision of Regex Heuristics. Designed for robustness, it features a Dual-Engine Architecture with automatic fallback logic to ensure 100% extraction coverage for business-critical fields (Invoice #, Date, Total) even when the AI model is uncertain.





[](https://huggingface.co/spaces/GSoumyajit2005/invoice-processor-ml)
---
## 🚀 Try it Live!
> **No installation required!** Try the full application instantly on Hugging Face Spaces:
>
> ### 👉 [**Launch Live Demo**](https://huggingface.co/spaces/GSoumyajit2005/invoice-processor-ml) 👈
>
> Upload any invoice image and watch the hybrid ML+Regex engine extract structured data in real-time.
---
## 🎯 Features
### 🧠 Core Intelligence
- **Hybrid Inference Engine:** Automatically triggers a Regex Fallback Engine if the ML model (LayoutLMv3) returns low confidence or missing critical fields (Invoice #, Date).
- **ML-Based Extraction:** Fine-tuned `LayoutLMv3` Transformer for semantic understanding of complex layouts (SROIE dataset).
- **Rule-Based Fallback:** Deterministic regex patterns ensure 100% coverage for standard fields when ML is uncertain.
### 🛡️ Robustness & Engineering
- **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
- **GPU-Accelerated OCR:** DocTR (Mindee) with automatic CUDA acceleration for faster inference in production.
- **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
- **Defensive Persistence:** Optional PostgreSQL integration (local Docker or cloud Supabase) that automatically saves extracted data when credentials are present, but gracefully degrades (skips saving) in serverless/demo environments.
- **Async Database Saves:** Background thread processing ensures fast UI response (~5-7s) while database operations happen asynchronously.
- **Duplicate Prevention:** Implemented _Semantic Hashing_ (Vendor + Date + Total + ID) to automatically detect and prevent duplicate invoice entries.
### 💻 Usability
- **Streamlit Web UI:** Interactive dashboard for real-time inference, visualization, and side-by-side comparison (ML vs. Regex).
- **PDF Preview & Overlay:** Visual preview of uploaded PDFs with ML-detected bounding boxes overlay for transparency.
- **CLI & Batch Processing:** Process single files or entire directories via command line with JSON export.
- **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.
> Note on Invoice Numbers: The SROIE dataset used for training does not include "Invoice Number" labels. To solve this, the system uses a Hybrid Fallback Mechanism: if the ML model (LayoutLMv3) returns null for the Invoice Number, the system automatically triggers a targeted Regex extraction to ensure this critical field is captured.
---
## 🛠️ Technical Deep Dive (Why this architecture?)
### 1. The "Safety Net" Fallback Logic
Standard ML models often fail on specific fields like "Invoice Number" if the layout is unseen. This system implements a **priority-based extraction**:
1. **Primary:** LayoutLMv3 predicts entity labels (context-aware).
2. **Fallback:** If `Invoice_No` or `Total` is null, the system executes a targeted Regex scan on the raw text.
_Result:_ Combines the generalization of AI with the determinism of Rules.
### 2. Robustness & Error Handling
- **OCR Noise:** Uses DocTR's deep learning-based text recognition for improved accuracy over traditional OCR.
- **Coordinate Normalization:** A custom `clamp()` function ensures all bounding boxes stay strictly within [0, 1000] to prevent Transformer index errors.
### 3. Dual-Engine Architecture
The system implements a **Dual-Engine Architecture** with automatic fallback logic:
1. **Primary Engine:** LayoutLMv3 predicts entity labels (context-aware).
2. **Fallback Engine:** If `Invoice_No` or `Total` is null, the system executes a targeted Regex scan on the raw text.
### 4. Clean JSON Output
The system outputs a clean JSON with the following fields:
- `receipt_number`: The invoice number (extracted by LayoutLMv3 or Regex).
- `date`: The invoice date (extracted by LayoutLMv3 or Regex).
- `bill_to`: The bill-to information (extracted by LayoutLMv3 or Regex).
- `items`: The list of items (extracted by LayoutLMv3 or Regex).
- `total_amount`: The total amount (extracted by LayoutLMv3 or Regex).
- `extraction_confidence`: The confidence of the extraction (0-100).
- `validation_passed`: Whether the validation passed (true/false).
### 5. Defensive Database Architecture
To support both local development (with full persistence) and lightweight cloud demos (without databases), the system uses a **"Soft Fail" Persistence Layer**:
1. **Connection Check:** On startup, the system checks for PostgreSQL credentials. If missing, the database engine is disabled.
2. **Repository Guard:** All CRUD operations check for an active session. If the database is disabled, save operations are skipped silently without crashing the pipeline.
3. **Semantic Hashing:** Before saving, a content-based hash is generated to ensure idempotency.
---
## 📊 Demo
### Web Interface

_Clean upload → extract flow with method selector (ML vs Regex)._
### Successful Extraction (ML-based)

_Fields extracted with LayoutLMv3._
### Format Detection (simulated)

_UI shows simple format hints and confidence._
### Example JSON (Rule-based)
```json
{
"receipt_number": "PEGIV-1030765",
"date": "15/01/2019",
"bill_to": {
"name": "THE PEAK QUARRY WORKS",
"email": null
},
"items": [],
"total_amount": 193.0,
"extraction_confidence": 100,
"validation_passed": true,
"vendor": "OJC MARKETING SDN BHD",
"address": "NO JALAN BAYU 4, BANDAR SERI ALAM, 81750 MASAI, JOHOR"
}
```
### Example JSON (ML-based)
```json
{
"receipt_number": null,
"date": "15/01/2019",
"bill_to": null,
"items": [],
"total_amount": 193.0,
"vendor": "OJC MARKETING SDN BHD",
"address": "NO JALAN BAYU 4, BANDAR SERI ALAM, 81750 MASAI, JOHOR",
"raw_text": "…",
"raw_ocr_words": ["…"],
"raw_predictions": {
"DATE": {"text": "15/01/2019", "bbox": [[…]]},
"TOTAL": {"text": "193.00", "bbox": [[…]]},
"COMPANY": {"text": "OJC MARKETING SDN BHD", "bbox": [[…]]},
"ADDRESS": {"text": "…", "bbox": [[…]]}
}
}
```
## 🚀 Quick Start
### Prerequisites
- Python 3.10+
- Conda / Miniforge (recommended)
- NVIDIA GPU with CUDA (strongly recommended for usable performance)
⚠️ CPU-only execution is supported but significantly slower
(5–10s per invoice) and intended only for testing.
### Installation (Conda – Recommended)
1. Clone the repository:
```bash
git clone https://github.com/GSoumyajit2005/invoice-processor-ml
cd invoice-processor-ml
```
2. Create and activate the Conda environment:
```bash
conda env create -f environment.yml
conda activate invoice-ml
```
3. Verify CUDA availability (recommended):
```bash
python - <<EOF
import torch
print(torch.cuda.is_available())
EOF
```
4. Run the web app
```bash
streamlit run app.py
```
> Note: `requirements.txt` is consumed internally by `environment.yml`.
> Do not install it manually with pip.
### Training the Model (Optional)
To retrain the model from scratch using the provided scripts:
```bash
python scripts/train_combined.py
```
(Note: Requires SROIE dataset in data/sroie)
### API Usage (Optional)
To run the API server:
```bash
python src/api.py
```
The API provides endpoints for processing invoices and extracting information.
### Running with Database (Optional)
To enable data persistence, run the included Docker Compose file to spin up PostgreSQL:
```bash
docker-compose up -d
```
The application will automatically detect the database and start saving invoices.
## 💻 Usage
### Web Interface (Recommended)
The easiest way to use the processor is via the web interface.
```bash
streamlit run app.py
```
- Upload an invoice image (PNG/JPG).
- Choose extraction method in sidebar:
- ML-Based (LayoutLMv3)
- Rule-Based (Regex)
- View JSON, download results.
### Command-Line Interface (CLI)
You can also process invoices directly from the command line.
#### 1. Processing a Single Invoice
This command processes the provided sample invoice and prints the results to the console.
```bash
python src/pipeline.py data/samples/sample_invoice.jpg --save --method ml
# or
python src/pipeline.py data/samples/sample_invoice.jpg --save --method rules
```
#### 2. Batch Processing a Folder
The CLI can process an entire folder of images at once.
First, place your own invoice images (e.g., `my_invoice1.jpg`, `my_invoice2.png`) into the `data/raw/` folder.
Then, run the following command. It will process all images in `data/raw/`. Saved files are written to `outputs/{stem}_{method}.json`.
```bash
python src/pipeline.py data/raw --save --method ml
```
### Python API
You can integrate the pipeline directly into your own Python scripts.
```python
from src.pipeline import process_invoice
import json
result = process_invoice('data/samples/sample_invoice.jpg', method='ml')
print(json.dumps(result, indent=2))
```
## 🏗️ Architecture
```
┌────────────────┐
│ Upload Image │
└───────┬────────┘
│
▼
┌────────────────────┐
│ Preprocessing │ (OpenCV grayscale/denoise)
└────────┬───────────┘
│
▼
┌───────────────┐
│ OCR │ (DocTR)
└───────┬───────┘
│
┌──────────────┴──────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌────────────────────────┐
│ Rule-based IE │ │ ML-based IE (NER) │
│ (regex, heur.) │ │ LayoutLMv3 token-class │
└────────┬─────────┘ └───────────┬────────────┘
│ │
└──────────────┬──────────────────┘
▼
┌──────────────────┐
│ Post-process │
│ validate, scores │
└────────┬─────────┘
│
┌──────────────┴──────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌────────────────────┐
│ JSON Output │ │ DB (PostgreSQL) │
└──────────────────┘ │ (Optional Save) │
└────────────────────┘
```
## 📁 Project Structure
```
invoice-processor-ml/
│
├── data/
│ ├── raw/ # Input invoice images for processing
│ └── processed/ # (Reserved for future use)
│
├── data/samples/
│ └── sample_invoice.jpg # Public sample for quick testing
│
├── docs/
│ └── screenshots/ # UI Screenshots for the README demo
│
├── models/
│ └── layoutlmv3-doctr-trained/ # Fine-tuned model (trained with DocTR OCR)
│
├── outputs/ # Default folder for saved JSON results
│
├── scripts/ # Training and analysis scripts
│ ├── eval_new_dataset.py # Evaluation scripts
│ ├── explore_new_dataset.py # Dataset exploration tools
│ ├── prepare_doctr_data.py # DocTR data alignment for training
│ ├── train_combined.py # Main training loop (SROIE + Custom Data)
│ └── train_layoutlm.py # LayoutLMv3 fine-tuning script
│
├── src/
│ ├── api.py # FastAPI REST endpoint for API access
│ ├── data_loader.py # Unified data loader for training
│ ├── database.py # Database connection with environment-aware 'soft fail' check
│ ├── extraction.py # Regex-based information extraction logic
│ ├── ml_extraction.py # ML-based extraction (LayoutLMv3 + DocTR)
│ ├── models.py # SQLModel tables (Invoice, LineItem) with schema validation
│ ├── pdf_utils.py # PDF text extraction and image conversion
│ ├── pipeline.py # Main orchestrator for the pipeline and CLI
│ ├── preprocessing.py # Image preprocessing functions (grayscale, denoise)
│ ├── repository.py # CRUD operations with session safety handling
│ ├── schema.py # Pydantic models for API response validation
│ ├── sroie_loader.py # SROIE dataset loading logic
│ └── utils.py # Utility functions (semantic hashing, etc.)
│
├── tests/
│ ├── test_extraction.py # Tests for regex extraction module
│ ├── test_full_pipeline.py # Full end-to-end integration tests
│ ├── test_pipeline.py # Pipeline process tests
│ └── test_preprocessing.py # Tests for the preprocessing module
│
├── app.py # Streamlit web interface
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment configuration
├── docker-compose.yml # Docker Compose configuration for PostgreSQL
├── Dockerfile # Dockerfile for building the application container
├── .gitignore # Git ignore file
└── README.md # You are Here!
```
## 🧠 Model & Training
- **Model**: `microsoft/layoutlmv3-base` (125M params)
- **Task**: Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
- **Dataset**: SROIE (ICDAR 2019, English retail receipts), mychen76/invoices-and-receipts_ocr_v1 (English)
- **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
- **Result**: F1 Score ≈ 0.83 (Real-world performance on DocTR-aligned validation set)
- Training scripts (local):
- `scripts/train_combined.py` (data prep, training loop with validation + model save)
- Model saved to: `models/layoutlmv3-doctr-trained/`
## 📈 Performance
- **OCR Precision**: State-of-the-art hierarchical detection using **DocTR (ResNet-50)**. Outperforms Tesseract on complex/noisy layouts.
- **ML-based Extraction**:
- **Accuracy**: ~83% F1 Score on SROIE + custom invoices
- **Speed**:
- **GPU (recommended)**: <1s per invoice
- **CPU (fallback)**: ~5–7s per invoice
⚠️ CPU-only execution is supported for testing and experimentation but results
in significantly higher latency due to the heavy OCR and layout-aware models.
## ⚠️ Known Limitations
1. **Layout Sensitivity**: The ML model was fine‑tuned on SROIE (retail receipts) and mychen76/invoices-and-receipts_ocr_v1 (English). Professional multi-column invoices may underperform until you fine‑tune on more diverse datasets.
2. **Invoice Number**: SROIE dataset lacks invoice number labels. The system solves this by using the Hybrid Fallback Engine, which successfully extracts invoice numbers using Regex whenever the ML model output is empty.
3. **Line Items/Tables**: Not trained for table extraction yet. Rule-based supports simple totals; table extraction comes later.
4. **Inference Latency**: CPU execution is significantly slower due to heavy OCR and layout-aware models.
## 🔮 Future Enhancements
- [x] Add and fine‑tune on mychen76/invoices-and-receipts_ocr_v1 (English) for broader invoice formats
- [ ] (Optional) Add FATURA (table-focused) for line-item extraction
- [ ] Sliding-window chunking for >512 token documents (to avoid truncation)
- [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
- [x] PDF support (pdf2image) for multipage invoices
- [x] FastAPI backend + Docker
- [x] CI/CD pipeline (GitHub Actions → HuggingFace Spaces auto-deploy)
- [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
- [ ] Confidence calibration and better validation rules
- [x] Database persistence layer (PostgreSQL with SQLModel & Redundancy checks)
## 🛠️ Tech Stack
| Component | Technology |
| ---------------- | ----------------------------------- |
| OCR | DocTR (Mindee) |
| Image Processing | OpenCV, Pillow |
| ML/NLP | PyTorch 2.x, Transformers |
| Model | LayoutLMv3 (token class.) |
| Web Interface | Streamlit |
| Data Format | JSON |
| CI/CD | GitHub Actions → HuggingFace Spaces |
| Containerization | Docker |
| Database | PostgreSQL, SQLModel |
| Containerization | Docker & Docker Compose |
## 📚 What I Learned
- OCR challenges (confusable characters, confidence-based filtering)
- Layout-aware NER with LayoutLMv3 (text + bbox + pixels)
- Data normalization (bbox to 0–1000 scale)
- End-to-end pipelines (UI + CLI + JSON output)
- When regex is enough vs when ML is needed
- Evaluation (seqeval F1 for NER)
## 🤝 Contributing
Contributions welcome! Areas needing improvement:
- New patterns for regex extractor
- Better preprocessing for OCR
- New datasets and training configs
- Tests and CI
## 📝 License
MIT License - See LICENSE file for details
## 👨💻 Author
**Soumyajit Ghosh** - 3rd Year BTech Student
- Exploring AI/ML and practical applications
- [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-tech) | [GitHub](https://github.com/GSoumyajit2005) | [Portfolio](https://soumyajitghosh.vercel.app)
---
**Note**: "This is a learning project demonstrating an end-to-end ML pipeline. Not recommended for production use without further validation, retraining on diverse datasets, and security hardening."
|