Spaces:

GSoumyajit2005
/

invoice-processor-ml

Running

App Files Files Community

GSoumyajit2005 commited on Nov 5, 2025

Commit

42e1c04

1 Parent(s): 566dc81

feat: LayoutLMv3 integration, Streamlit UI toggle, README refresh, .gitignore

Browse files

Files changed (6) hide show

.gitignore +21 -1
README.md +158 -112
app.py +24 -6
requirements.txt +0 -0
src/ml_extraction.py +176 -0
src/pipeline.py +45 -20

.gitignore CHANGED Viewed

@@ -17,10 +17,18 @@ credentials.json
 *.swp
 *.swo
 # OS
 .DS_Store
 Thumbs.db
 ehthumbs.db
 Desktop.ini
 # Streamlit temp folder
@@ -47,4 +55,16 @@ data/processed/*
 !data/processed/.gitkeep
 !requirements.txt
-!README.md

 *.swp
 *.swo
+# Notebooks / caches / logs
+.ipynb_checkpoints/
+.pytest_cache/
+*.log
+logs/
+.cache/
 # OS
 .DS_Store
 Thumbs.db
 ehthumbs.db
+*.code-workspace
 Desktop.ini
 # Streamlit temp folder
 !data/processed/.gitkeep
 !requirements.txt
+!README.md
+datasets/
+checkpoints/
+lightning_logs/
+wandb/
+mlruns/
+# Ignore all files in the models directory
+models/*
+!models/.gitkeep
+!models/README.md

README.md CHANGED Viewed

@@ -1,37 +1,45 @@
 # 📄 Smart Invoice Processor
-An end-to-end invoice processing system that automatically extracts structured data from scanned invoices and receipts using OCR and pattern recognition.
 ![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)
 ![Streamlit](https://img.shields.io/badge/Streamlit-1.51+-red.svg)
 ![Tesseract](https://img.shields.io/badge/Tesseract-5.0+-green.svg)
 ## 🎯 Features
-- ✅ **Automatic Text Extraction** - OCR using Tesseract
-- ✅ **Structured Data Output** - JSON format with all key fields
-- ✅ **OCR Error Correction** - Fixes common character recognition mistakes
-- ✅ **Confidence Scoring** - Reports extraction reliability
-- ✅ **Format Detection** - Identifies invoice template type
-- ✅ **Batch Processing** - Handle multiple invoices at once
-- ✅ **Web Interface** - User-friendly drag-and-drop UI
-- ✅ **Validation** - Automatic data consistency checks
 ## 📊 Demo
 ### Web Interface
 ![Homepage](docs/screenshots/homepage.png)
-*Clean, user-friendly interface for invoice upload*
-### Successful Extraction (100% Confidence)
 ![Success Result](docs/screenshots/success_result.png)
-*All fields extracted correctly from supported format*
-### Format Detection
 ![Format Detection](docs/screenshots/format_detection.png)
-*System identifies invoice type and explains confidence score*
-### Extracted Data
 ```json
 {
   "receipt_number": "PEGIV-1030765",
@@ -40,17 +48,32 @@ An end-to-end invoice processing system that automatically extracts structured d
     "name": "THE PEAK QUARRY WORKS",
     "email": null
   },
-  "items": [
-    {
-      "description": "SR",
-      "quantity": 111,
-      "unit_price": 1193.0,
-      "total": 193.0
-    }
-  ],
   "total_amount": 193.0,
   "extraction_confidence": 100,
-  "validation_passed": false
 }
 ```
@@ -59,12 +82,13 @@ An end-to-end invoice processing system that automatically extracts structured d
 ### Prerequisites
 - Python 3.10+
 - Tesseract OCR
 ### Installation
 1. Clone the repository
 ```bash
-git clone https://github.com/yourusername/invoice-processor-ml
 cd invoice-processor-ml
 ```
@@ -78,7 +102,12 @@ pip install -r requirements.txt
 - **Mac**: `brew install tesseract`
 - **Linux**: `sudo apt install tesseract-ocr`
-4. Run the web app
 ```bash
 streamlit run app.py
 ```
@@ -92,7 +121,11 @@ The easiest way to use the processor is via the web interface.
 ```bash
 streamlit run app.py
 ```
-Then, open your browser to the provided URL, upload an invoice image, and click "Extract Data".
 ### Command-Line Interface (CLI)
@@ -103,12 +136,9 @@ You can also process invoices directly from the command line.
 This command processes the provided sample invoice and prints the results to the console.
 ```bash
-python src/pipeline.py data/samples/sample_invoice.jpg
-```
-To save the output to a JSON file in the `outputs/` directory:
-```bash
-python src/pipeline.py data/samples/sample_invoice.jpg --save
 ```
 #### 2. Batch Processing a Folder
@@ -117,10 +147,10 @@ The CLI can process an entire folder of images at once.
 First, place your own invoice images (e.g., `my_invoice1.jpg`, `my_invoice2.png`) into the `data/raw/` folder.
-Then, run the following command. It will process all images in `data/raw/` and save a corresponding `.json` file for each in the `outputs/` directory.
 ```bash
-python src/pipeline.py data/raw --save
 ```
 ### Python API
@@ -131,47 +161,45 @@ You can integrate the pipeline directly into your own Python scripts.
 from src.pipeline import process_invoice
 import json
-# Define the path to your image
-image_path = 'data/samples/sample_invoice.jpg'
-# The function handles everything: loading, OCR, and extraction
-result_data = process_invoice(image_path)
-# Pretty-print the final structured JSON
-print(json.dumps(result_data, indent=2))
 ```
 ## 🏗️ Architecture
 ```
-┌─────────────┐
-│ Upload Image│
-└──────┬──────┘
-       │
-       ▼
-┌──────────────┐
-│  OCR Engine  │ ← Tesseract
-└──────┬───────┘
-       │
-       ▼
-┌──────────────────┐
-│ Error Correction │ ← Fix J→1, O→0
-└──────┬───────────┘
-       │
-       ▼
-┌──────────────────┐
-│ Pattern Matching │ ← Regex extraction
-└──────┬───────────┘
-       │
-       ▼
-┌──────────────────┐
-│   Validation     │ ← Logic checks
-└──────┬───────────┘
-       │
-       ▼
-┌──────────────┐
-│ JSON Output  │
-└──────────────┘
 ```
 ## 📁 Project Structure
@@ -183,58 +211,74 @@ invoice-processor-ml/
 │   ├── raw/                    # Input invoice images for processing
 │   └── processed/              # (Reserved for future use)
 │
 ├── docs/
-│ └── screenshots/ # Screenshots for the README demo
 │
-├── outputs/ # Default folder for saved JSON results
 │
 ├── src/
-│   ├── preprocessing.py    # Image preprocessing functions (grayscale, denoise)
-│   ├── ocr.py        # Tesseract OCR integration
-│   ├── extraction.py        # Regex-based information extraction logic
-│   └── pipeline.py    # Main orchestrator for the pipeline and CLI
 │
 │
 ├── tests/ # <-- ADD THIS FOLDER
-│ ├── test_preprocessing.py # Tests for the preprocessing module
-│ ├── test_ocr.py # Tests for the OCR module
-│ └── test_pipeline.py # End-to-end pipeline tests
 │
 ├── app.py                      # Streamlit web interface
 ├── requirements.txt            # Python dependencies
 └── README.md                   # You are Here!
 ```
-## 🎯 Extraction Accuracy
-| Invoice Format | Accuracy | Status |
-|----------------|----------|--------|
-| **Template A** (Retail Receipts) | 95-100% | ✅ Fully Supported |
-| **Template B** (Professional) | 10-20% | ⚠️ Limited Support |
-| Other formats | Variable | ❌ Not Optimized |
 ## 📈 Performance
-- **Processing Speed**: ~0.3-0.5 seconds per invoice
-- **OCR Accuracy**: 94%+ character accuracy on clear images
-- **Field Extraction**: 100% on supported formats
 ## ⚠️ Known Limitations
-1. **Format Dependency**: Currently optimized for retail receipt format (Template A)
-2. **Image Quality**: Requires clear, well-lit images for best results
-3. **Pattern-Based**: Uses regex patterns, not ML (limited flexibility)
-4. **Language**: English only
 ## 🔮 Future Enhancements
-- [ ] Add ML-based extraction (LayoutLM) for multi-format support
-- [ ] Support for handwritten invoices
-- [ ] Multi-language OCR
-- [ ] Table detection for complex line items
-- [ ] PDF support
-- [ ] Cloud deployment (AWS/GCP)
-- [ ] API endpoints (FastAPI)
 ## 🛠️ Tech Stack
@@ -242,25 +286,27 @@ invoice-processor-ml/
 |-----------|------------|
 | OCR | Tesseract 5.0+ |
 | Image Processing | OpenCV, Pillow |
-| Pattern Matching | Python Regex |
 | Web Interface | Streamlit |
 | Data Format | JSON |
 ## 📚 What I Learned
-- **OCR challenges**: Character confusion (1/I/l/J), image quality dependency
-- **Real-world ML**: Handling graceful degradation for unsupported formats
-- **Pipeline design**: Building robust multi-stage processing systems
-- **Validation importance**: Can't trust ML outputs without verification
-- **Trade-offs**: Rule-based vs ML-based approaches
 ## 🤝 Contributing
 Contributions welcome! Areas needing improvement:
-- Additional invoice format patterns
-- Better image preprocessing
-- ML model integration
-- Test coverage
 ## 📝 License
@@ -270,8 +316,8 @@ MIT License - See LICENSE file for details
 **Soumyajit Ghosh** - 3rd Year BTech Student
 - Exploring AI/ML and practical applications
-- [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-49a5b02b2?utm_source=share&utm_campaign) | [GitHub](https://github.com/GSoumyajit2005) | [Portfolio](#)
 ---
-**Note**: This is a learning project demonstrating end-to-end ML pipeline development. Not recommended for production use without additional validation and security measures.

 # 📄 Smart Invoice Processor
+End-to-end invoice/receipt processing with OCR + Rule-based extraction and a fine‑tuned LayoutLMv3 model. Upload an image or run via CLI to get clean, structured JSON (vendor, date, totals, address, etc.).
 ![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)
 ![Streamlit](https://img.shields.io/badge/Streamlit-1.51+-red.svg)
 ![Tesseract](https://img.shields.io/badge/Tesseract-5.0+-green.svg)
+![Transformers](https://img.shields.io/badge/Transformers-4.x-purple.svg)
+![PyTorch](https://img.shields.io/badge/PyTorch-2.x-orange.svg)
+---
 ## 🎯 Features
+- ✅ OCR using Tesseract (configurable, fast, multi-platform)
+- ✅ Rule-based extraction (regex baselines)
+- ✅ ML-based extraction (LayoutLMv3 fine‑tuned on SROIE) for robust field detection
+- ✅ Clean JSON output (date, total, vendor, address, receipt number*)
+- ✅ Confidence and simple validation (e.g., total found among amounts)
+- ✅ Streamlit web UI with method toggle (ML vs Regex)
+- ✅ CLI for single/batch processing with saving to JSON
+- ✅ Tests for preprocessing/OCR/pipeline
+> Note: SROIE does not include invoice/receipt number labels; the ML model won’t output it unless you add labeled data. The rule-based extractor can still provide it when formats allow.
+---
 ## 📊 Demo
 ### Web Interface
 ![Homepage](docs/screenshots/homepage.png)
+*Clean upload → extract flow with method selector (ML vs Regex).*
+### Successful Extraction (ML-based)
 ![Success Result](docs/screenshots/success_result.png)
+*Fields extracted with LayoutLMv3.*
+### Format Detection (simulated)
 ![Format Detection](docs/screenshots/format_detection.png)
+*UI shows simple format hints and confidence.*
+### Example JSON (Rule-based)
 ```json
 {
   "receipt_number": "PEGIV-1030765",
     "name": "THE PEAK QUARRY WORKS",
     "email": null
   },
+  "items": [],
   "total_amount": 193.0,
   "extraction_confidence": 100,
+  "validation_passed": true,
+  "vendor": "OJC MARKETING SDN BHD",
+  "address": "NO JALAN BAYU 4, BANDAR SERI ALAM, 81750 MASAI, JOHOR"
+}
+```
+### Example JSON (ML-based)
+```json
+{
+  "receipt_number": null,
+  "date": "15/01/2019",
+  "bill_to": null,
+  "items": [],
+  "total_amount": 193.0,
+  "vendor": "OJC MARKETING SDN BHD",
+  "address": "NO JALAN BAYU 4, BANDAR SERI ALAM, 81750 MASAI, JOHOR",
+  "raw_text": "…",
+  "raw_ocr_words": ["…"],
+  "raw_predictions": {
+    "DATE": {"text": "15/01/2019", "bbox": [[…]]},
+    "TOTAL": {"text": "193.00", "bbox": [[…]]},
+    "COMPANY": {"text": "OJC MARKETING SDN BHD", "bbox": [[…]]},
+    "ADDRESS": {"text": "…", "bbox": [[…]]}
+  }
 }
 ```
 ### Prerequisites
 - Python 3.10+
 - Tesseract OCR
+- (Optional) CUDA-capable GPU for training/inference speed
 ### Installation
 1. Clone the repository
 ```bash
+git clone https://github.com/GSoumyajit2005/invoice-processor-ml
 cd invoice-processor-ml
 ```
 - **Mac**: `brew install tesseract`
 - **Linux**: `sudo apt install tesseract-ocr`
+4. (Optional, Windows) Set Tesseract path in src/ocr.py if needed:
+```bash
+pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
+```
+5. Run the web app
 ```bash
 streamlit run app.py
 ```
 ```bash
 streamlit run app.py
 ```
+- Upload an invoice image (PNG/JPG).
+- Choose extraction method in sidebar:
+       - ML-Based (LayoutLMv3)
+       - Rule-Based (Regex)
+- View JSON, download results.
 ### Command-Line Interface (CLI)
 This command processes the provided sample invoice and prints the results to the console.
 ```bash
+python src/pipeline.py data/samples/sample_invoice.jpg --save --method ml
+# or
+python src/pipeline.py data/samples/sample_invoice.jpg --save --method rules
 ```
 #### 2. Batch Processing a Folder
 First, place your own invoice images (e.g., `my_invoice1.jpg`, `my_invoice2.png`) into the `data/raw/` folder.
+Then, run the following command. It will process all images in `data/raw/`. Saved files are written to `outputs/{stem}_{method}.json`.
 ```bash
+python src/pipeline.py data/raw --save --method ml
 ```
 ### Python API
 from src.pipeline import process_invoice
 import json
+result = process_invoice('data/samples/sample_invoice.jpg', method='ml')
+print(json.dumps(result, indent=2))
 ```
 ## 🏗️ Architecture
 ```
+                           ┌────────────────┐
+                           │  Upload Image  │
+                           └───────┬────────┘
+                                   │
+                                   ▼
+                         ┌────────────────────┐
+                         │   Preprocessing    │  (OpenCV grayscale/denoise)
+                         └────────┬───────────┘
+                                  │
+                                  ▼
+                          ┌───────────────┐
+                          │     OCR       │  (Tesseract)
+                          └───────┬───────┘
+                                  │
+                   ┌──────────────┴──────────────┐
+                   │                             │
+                   ▼                             ▼
+         ┌──────────────────┐           ┌────────────────────────┐
+         │  Rule-based IE   │           │   ML-based IE (NER)    │
+         │  (regex, heur.)  │           │ LayoutLMv3 token-class │
+         └────────┬─────────┘           └───────────┬────────────┘
+                  │                                 │
+                  └──────────────┬──────────────────┘
+                                 ▼
+                         ┌──────────────────┐
+                         │   Post-process   │
+                         │ validate, scores │
+                         └────────┬─────────┘
+                                  ▼
+                         ┌──────────────────┐
+                         │    JSON Output   │
+                         └───────��──────────┘
 ```
 ## 📁 Project Structure
 │   ├── raw/                    # Input invoice images for processing
 │   └── processed/              # (Reserved for future use)
 │
+│
+├── data/samples/
+│   └── sample_invoice.jpg      # Public sample for quick testing
+│
 ├── docs/
+│ └── screenshots/              # UI Screenshots for the README demo
 │
+│
+├── models/
+│   └── layoutlmv3-sroie-best/  # Fine-tuned model (created after training)
+│
+├── outputs/                    # Default folder for saved JSON results
 │
 ├── src/
+│   ├── preprocessing.py        # Image preprocessing functions (grayscale, denoise)
+│   ├── ocr.py                  # Tesseract OCR integration
+│   ├── extraction.py           # Regex-based information extraction logic
+│   ├── ml_extraction.py        # ML-based extraction (LayoutLMv3)
+│   └── pipeline.py             # Main orchestrator for the pipeline and CLI
 │
 │
 ├── tests/ # <-- ADD THIS FOLDER
+│ ├── test_preprocessing.py       # Tests for the preprocessing module
+│ ├── test_ocr.py                 # Tests for the OCR module
+│ └── test_pipeline.py            # End-to-end pipeline tests
 │
 ├── app.py                      # Streamlit web interface
 ├── requirements.txt            # Python dependencies
 └── README.md                   # You are Here!
 ```
+## 🧠 Model & Training
+- **Model**: `microsoft/layoutlmv3-base` (125M params)
+- **Task**:  Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
+- **Dataset**: SROIE (ICDAR 2019, English retail receipts)
+- **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
+- **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)
+- Training scripts(local):
+- `train_layoutlm.py` (data prep, training loop with validation + model save)
+- Model saved to: `models/layoutlmv3-sroie-best/`
 ## 📈 Performance
+- **OCR accuracy (clear images)**: High with Tesseract
+- **Rule-based extraction**: Strong on simple retail receipts
+- **ML-based extraction (SROIE-style)**:
+       - COMPANY / ADDRESS / DATE / TOTAL: High F1 on simple receipts
+       - Complex business invoices: Partial extraction unless further fine-tuned
 ## ⚠️ Known Limitations
+1. **Layout Sensitivity**: The ML model was fine‑tuned only on SROIE (retail receipts). Professional multi-column invoices may underperform until you fine‑tune on more diverse datasets.
+2. **Invoice Number (ML)**: SROIE lacks invoice number labels; the ML model won’t output it unless you add labeled data. The rule-based method can still recover it on many formats.
+3. **Line Items/Tables**: Not trained for table extraction yet. Rule-based supports simple totals; table extraction comes later.
+4. **OCR Variability**: Tesseract outputs can vary; preprocessing and thresholds can impact ML results.
 ## 🔮 Future Enhancements
+- [ ] Add and fine‑tune on mychen76/invoices-and-receipts_ocr_v1 (English) for broader invoice formats
+- [ ] (Optional) Add FATURA (table-focused) for line-item extraction
+- [ ] Sliding-window chunking for >512 token documents (to avoid truncation)
+- [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
+- [ ] PDF support (pdf2image) for multipage invoices
+- [ ] FastAPI backend + Docker
+- [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
+- [ ] Confidence calibration and better validation rules
 ## 🛠️ Tech Stack
 |-----------|------------|
 | OCR | Tesseract 5.0+ |
 | Image Processing | OpenCV, Pillow |
+| ML/NLP | PyTorch 2.x, Transformers |
+| Model | LayoutLMv3 (token class.) |
 | Web Interface | Streamlit |
 | Data Format | JSON |
 ## 📚 What I Learned
+- OCR challenges (confusable characters, confidence-based filtering)
+- Layout-aware NER with LayoutLMv3 (text + bbox + pixels)
+- Data normalization (bbox to 0–1000 scale)
+- End-to-end pipelines (UI + CLI + JSON output)
+- When regex is enough vs when ML is needed
+- Evaluation (seqeval F1 for NER)
 ## 🤝 Contributing
 Contributions welcome! Areas needing improvement:
+- New patterns for regex extractor
+- Better preprocessing for OCR
+- New datasets and training configs
+- Tests and CI
 ## 📝 License
 **Soumyajit Ghosh** - 3rd Year BTech Student
 - Exploring AI/ML and practical applications
+- [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-49a5b02b2?utm_source=share&utm_campaign) | [GitHub](https://github.com/GSoumyajit2005) | [Portfolio](Coming Soon)
 ---
+**Note**: "This is a learning project demonstrating an end-to-end ML pipeline. Not recommended for production use without further validation, retraining on diverse datasets, and security hardening."

app.py CHANGED Viewed

@@ -114,6 +114,13 @@ with st.sidebar:
         st.session_state.processed_count = 0
     st.metric("Invoices Processed Today", st.session_state.processed_count)
 # Main content
 tab1, tab2, tab3 = st.tabs(["📤 Upload & Process", "📚 Sample Invoices", "ℹ️ How It Works"])
@@ -150,7 +157,12 @@ with tab1:
                         # Step 1: Call YOUR full pipeline function
                         st.write("✅ Calling `process_invoice`...")
-                        extracted_data = process_invoice(temp_path)
                         # Step 2: Simulate format detection using the extracted data
                         st.write("✅ Simulating format detection...")
@@ -206,12 +218,18 @@ with tab1:
                 st.warning("⚠️ Validation Failed: Total amount could not be verified against other numbers.")
             # Key metrics display
             res_col1, res_col2, res_col3 = st.columns(3)
-            res_col1.metric("Receipt Number", data.get('receipt_number') or "N/A")
-            res_col2.metric("Date", data.get('date') or "N/A")
-            res_col3.metric("Total Amount", f"${data.get('total_amount'):.2f}" if data.get('total_amount') is not None else "N/A")
-            st.metric("Customer Name", data.get('bill_to', {}).get('name') if data.get('bill_to') else "N/A")
             # Line items table
             if data.get('items'):

         st.session_state.processed_count = 0
     st.metric("Invoices Processed Today", st.session_state.processed_count)
+    st.header("⚙️ Configuration")
+    extraction_method = st.selectbox(
+        "Choose Extraction Method:",
+        ('ML-Based (LayoutLMv3)', 'Rule-Based (Regex)'),
+        help="ML-Based is more robust but may miss fields not in its training data. Rule-Based is faster but more fragile."
+    )
 # Main content
 tab1, tab2, tab3 = st.tabs(["📤 Upload & Process", "📚 Sample Invoices", "ℹ️ How It Works"])
                         # Step 1: Call YOUR full pipeline function
                         st.write("✅ Calling `process_invoice`...")
+                        # Map the user-friendly name from the dropdown to the actual method parameter
+                        method = 'ml' if extraction_method == 'ML-Based (LayoutLMv3)' else 'rules'
+                        st.write(f"⚙️ Using **{method.upper()}** extraction method...")
+                        # Call the pipeline with the selected method
+                        extracted_data = process_invoice(temp_path, method=method)
                         # Step 2: Simulate format detection using the extracted data
                         st.write("✅ Simulating format detection...")
                 st.warning("⚠️ Validation Failed: Total amount could not be verified against other numbers.")
             # Key metrics display
+            # Key metrics display
+            st.metric("🏢 Vendor", data.get('vendor') or "N/A") # <-- ADD THIS
             res_col1, res_col2, res_col3 = st.columns(3)
+            res_col1.metric("📄 Receipt Number", data.get('receipt_number') or "N/A")
+            res_col2.metric("📅 Date", data.get('date') or "N/A")
+            res_col3.metric("💵 Total Amount", f"${data.get('total_amount'):.2f}" if data.get('total_amount') is not None else "N/A")
+            # Use an expander for longer text fields like address
+            with st.expander("Show More Details"):
+                st.markdown(f"**👤 Bill To:** {data.get('bill_to', {}).get('name') if data.get('bill_to') else 'N/A'}")
+                st.markdown(f"**📍 Vendor Address:** {data.get('address') or 'N/A'}")
             # Line items table
             if data.get('items'):

requirements.txt CHANGED Viewed

Binary files a/requirements.txt and b/requirements.txt differ

src/ml_extraction.py ADDED Viewed

	@@ -0,0 +1,176 @@

+# src/ml_extraction.py
+import torch
+from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
+from PIL import Image
+import pytesseract
+from typing import List, Dict, Any
+import re
+# --- CONFIGURATION ---
+# The local path where we expect to find/save the model
+LOCAL_MODEL_PATH = "./models/layoutlmv3-sroie-best"
+# The Hugging Face Hub ID for the model to download if not found locally
+HUB_MODEL_ID = "GSoumyajit2005/layoutlmv3-sroie-invoice-extraction"
+# --- Function to load the model ---
+def load_model_and_processor(model_path, hub_id):
+    """
+    Tries to load the model from a local path. If it fails,
+    it downloads it from the Hugging Face Hub.
+    """
+    try:
+        # Try loading from local path first
+        print(f"Attempting to load model from local path: {model_path}...")
+        processor = LayoutLMv3Processor.from_pretrained(model_path)
+        model = LayoutLMv3ForTokenClassification.from_pretrained(model_path)
+        print("✅ Model loaded successfully from local path.")
+    except OSError:
+        # If it fails, download from the Hub
+        print(f"Model not found locally. Downloading from Hugging Face Hub: {hub_id}...")
+        from huggingface_hub import snapshot_download
+        # Download the model files and save them to the local path
+        snapshot_download(repo_id=hub_id, local_dir=model_path, local_dir_use_symlinks=False)
+        # Now load from the local path again
+        processor = LayoutLMv3Processor.from_pretrained(model_path)
+        model = LayoutLMv3ForTokenClassification.from_pretrained(model_path)
+        print("✅ Model downloaded and loaded successfully from the Hub.")
+    return model, processor
+# --- Load the model and processor only ONCE when the module is imported ---
+MODEL, PROCESSOR = load_model_and_processor(LOCAL_MODEL_PATH, HUB_MODEL_ID)
+if MODEL and PROCESSOR:
+    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    MODEL.to(DEVICE)
+    MODEL.eval()
+    print(f"ML Model is ready on device: {DEVICE}")
+else:
+    DEVICE = None
+    print("❌ Could not load ML model.")
+# --- Helper Function to group entities ---
+def _process_predictions(words: List[str], unnormalized_boxes: List[List[int]], encoding, predictions: List[int], id2label: Dict[int, str]) -> Dict[str, Any]:
+    word_ids = encoding.word_ids(batch_index=0)
+    word_level_preds = {}
+    for idx, word_id in enumerate(word_ids):
+        if word_id is not None:
+            label_id = predictions[idx]
+            if label_id != -100:
+                if word_id not in word_level_preds:
+                    word_level_preds[word_id] = id2label[label_id]
+    entities = {}
+    for word_idx, label in word_level_preds.items():
+        if label == 'O': continue
+        entity_type = label[2:]
+        word = words[word_idx]
+        if label.startswith('B-'):
+            entities[entity_type] = {"text": word, "bbox": [unnormalized_boxes[word_idx]]}
+        elif label.startswith('I-') and entity_type in entities:
+            if word_idx > 0 and word_level_preds.get(word_idx - 1) in (f'B-{entity_type}', f'I-{entity_type}'):
+                entities[entity_type]['text'] += " " + word
+                entities[entity_type]['bbox'].append(unnormalized_boxes[word_idx])
+            else:
+                 entities[entity_type] = {"text": word, "bbox": [unnormalized_boxes[word_idx]]}
+    # Clean up the final text field
+    for entity in entities.values():
+        entity['text'] = entity['text'].strip()
+    return entities
+# --- Main Function to be called from the pipeline ---
+def extract_ml_based(image_path: str) -> Dict[str, Any]:
+    """
+    Performs end-to-end ML-based extraction on a single image.
+    Args:
+        image_path: The path to the invoice image.
+    Returns:
+        A dictionary containing the extracted entities.
+    """
+    if not MODEL or not PROCESSOR:
+        raise RuntimeError("ML model is not loaded. Cannot perform extraction.")
+    # 1. Load Image
+    image = Image.open(image_path).convert("RGB")
+    width, height = image.size
+    # 2. Perform OCR
+    ocr_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
+    n_boxes = len(ocr_data['level'])
+    words = []
+    unnormalized_boxes = []
+    for i in range(n_boxes):
+        if int(ocr_data['conf'][i]) > 60 and ocr_data['text'][i].strip() != '':
+            word = ocr_data['text'][i]
+            (x, y, w, h) = (ocr_data['left'][i], ocr_data['top'][i], ocr_data['width'][i], ocr_data['height'][i])
+            words.append(word)
+            unnormalized_boxes.append([x, y, x + w, y + h])
+    # 3. Normalize Boxes and Prepare Inputs
+    normalized_boxes = []
+    for box in unnormalized_boxes:
+        normalized_boxes.append([
+            int(1000 * (box[0] / width)),
+            int(1000 * (box[1] / height)),
+            int(1000 * (box[2] / width)),
+            int(1000 * (box[3] / height)),
+        ])
+    # 4. Process with LayoutLMv3 Processor
+    encoding = PROCESSOR(
+        image,
+        text=words,
+        boxes=normalized_boxes,
+        truncation=True,
+        max_length=512,
+        return_tensors="pt"
+    ).to(DEVICE)
+    # 5. Run Inference
+    with torch.no_grad():
+        outputs = MODEL(**encoding)
+    predictions = outputs.logits.argmax(-1).squeeze().tolist()
+    # 6. Post-process to get final entities
+    extracted_entities = _process_predictions(words, unnormalized_boxes, encoding, predictions, MODEL.config.id2label)
+    # 7. Format the output to be consistent with your rule-based output
+        # Format the output to be consistent with the desired UI structure
+        # Format the output to be a superset of all possible fields
+    final_output = {
+        # --- Standard UI Fields ---
+        "receipt_number": None,  # SROIE doesn't train for this. Your regex model will provide it.
+        "date": extracted_entities.get("DATE", {}).get("text"),
+        "bill_to": None,         # SROIE doesn't train for this. Your regex model will provide it.
+        "items": [],             # SROIE doesn't train for line items.
+        "total_amount": None,
+        # --- Additional Fields from ML Model ---
+        "vendor": extracted_entities.get("COMPANY", {}).get("text"), # The ML model finds 'COMPANY'
+        "address": extracted_entities.get("ADDRESS", {}).get("text"),
+        # --- Debugging Info ---
+        "raw_text": " ".join(words),
+        "raw_ocr_words": words,
+        "raw_predictions": extracted_entities
+    }
+    # Safely extract and convert total
+    total_text = extracted_entities.get("TOTAL", {}).get("text")
+    if total_text:
+        try:
+            cleaned_total = re.sub(r'[^\d.]', '', total_text)
+            final_output["total_amount"] = float(cleaned_total)
+        except (ValueError, TypeError):
+            final_output["total_amount"] = None
+    return final_output

src/pipeline.py CHANGED Viewed

@@ -11,37 +11,61 @@ import json
 from preprocessing import load_image, convert_to_grayscale, remove_noise
 from ocr import extract_text
 from extraction import structure_output
-def process_invoice(image_path: str, save_results: bool = False, output_dir: str = 'outputs') -> Dict[str, Any]:
     """
-    Process an invoice image and extract structured information
     """
     if not Path(image_path).exists():
         raise FileNotFoundError(f"Image not found at path: {image_path}")
-    image = load_image(image_path)
-    try:
-        gray_image = convert_to_grayscale(image)
-        preprocessed_image = remove_noise(gray_image, kernel_size=3)
-    except Exception as e:
-        raise ValueError(f"Error during preprocessing: {e}")
-    text = extract_text(preprocessed_image, config='--psm 6')
-    structured_data = structure_output(text)
     if save_results:
         output_path = Path(output_dir)
         output_path.mkdir(parents=True, exist_ok=True)
-        json_path = output_path / (Path(image_path).stem + '.json')
         try:
-            with open(json_path, 'w', encoding='utf-8') as file:
-                json.dump(structured_data, file, indent=2, ensure_ascii=False)
-        except TypeError as e:
-            raise ValueError(f"Data not JSON-serializable: {e}")
-        except OSError as e:
-            raise IOError(f"Error saving results to {json_path}:\n {e}")
     return structured_data
@@ -89,6 +113,7 @@ Examples:
     parser.add_argument('path', help='Path to an invoice image or a folder of images')
     parser.add_argument('--save', action='store_true', help='Save results to JSON files')
     parser.add_argument('--output', default='outputs', help='Output directory for JSON files')
     args = parser.parse_args()
@@ -99,7 +124,7 @@ Examples:
         elif Path(args.path).is_file():
             # Corrected: Use args.path
             print(f"🔄 Processing: {args.path}")
-            result = process_invoice(args.path, save_results=args.save, output_dir=args.output)
             print("\n📊 Extracted Data:")
             print("=" * 60)

 from preprocessing import load_image, convert_to_grayscale, remove_noise
 from ocr import extract_text
 from extraction import structure_output
+from ml_extraction import extract_ml_based
+def process_invoice(image_path: str,
+                   method: str = 'ml', # <-- New parameter: 'ml' or 'rules'
+                   save_results: bool = False,
+                   output_dir: str = 'outputs') -> Dict[str, Any]:
     """
+    Process an invoice image using either rule-based or ML-based extraction.
+    Args:
+        image_path: Path to the invoice image.
+        method: The extraction method to use ('ml' or 'rules'). Default is 'ml'.
+        save_results: Whether to save JSON results to a file.
+        output_dir: Directory to save results.
+    Returns:
+        A dictionary with the extracted invoice data.
     """
     if not Path(image_path).exists():
         raise FileNotFoundError(f"Image not found at path: {image_path}")
+    print(f"Processing with '{method}' method...")
+    if method == 'ml':
+        # --- ML-Based Extraction ---
+        try:
+            # The ml_extraction function handles everything internally
+            structured_data = extract_ml_based(image_path)
+        except Exception as e:
+            raise ValueError(f"Error during ML-based extraction: {e}")
+    elif method == 'rules':
+        # --- Rule-Based Extraction (Your original logic) ---
+        try:
+            image = load_image(image_path)
+            gray_image = convert_to_grayscale(image)
+            preprocessed_image = remove_noise(gray_image, kernel_size=3)
+            text = extract_text(preprocessed_image, config='--psm 6')
+            structured_data = structure_output(text) # Calls your old extraction.py
+        except Exception as e:
+            raise ValueError(f"Error during rule-based extraction: {e}")
+    else:
+        raise ValueError(f"Unknown extraction method: '{method}'. Choose 'ml' or 'rules'.")
+    # --- Saving Logic (remains the same) ---
     if save_results:
         output_path = Path(output_dir)
         output_path.mkdir(parents=True, exist_ok=True)
+        json_path = output_path / (Path(image_path).stem + f"_{method}.json") # Add method to filename
         try:
+            with open(json_path, 'w', encoding='utf-8') as f:
+                json.dump(structured_data, f, indent=2, ensure_ascii=False)
+        except Exception as e:
+            raise IOError(f"Error saving results to {json_path}: {e}")
     return structured_data
     parser.add_argument('path', help='Path to an invoice image or a folder of images')
     parser.add_argument('--save', action='store_true', help='Save results to JSON files')
     parser.add_argument('--output', default='outputs', help='Output directory for JSON files')
+    parser.add_argument('--method', default='ml', choices=['ml', 'rules'], help="Extraction method: 'ml' or 'rules'")
     args = parser.parse_args()
         elif Path(args.path).is_file():
             # Corrected: Use args.path
             print(f"🔄 Processing: {args.path}")
+            result = process_invoice(args.path, method=args.method, save_results=args.save, output_dir=args.output)
             print("\n📊 Extracted Data:")
             print("=" * 60)