Spaces:

GSoumyajit2005
/

invoice-processor-ml

Sleeping

GSoumyajit2005 commited on Jan 17

Commit

2a944a5

1 Parent(s): 097a95c

feat: PDF preview, database integration, and improved error handling

- Add PDF preview support using pdf2image
- Enable bounding box overlay visualization for PDFs
- Implement database persistence with SQLModel (Invoice, LineItem)
- Add InvoiceRepository with save and duplicate detection
- Improve DB status messages (show 'unavailable' once at startup)
- Show 'Demo Mode' toast only once per session
- Fix torch.load and transformers deprecation warnings
- Add conda environment.yml for reproducible setup
- Update README with conda installation instructions

Files changed (12) hide show

README.md +78 -29
app.py +88 -39
docker-compose.yml +1 -1
environment.yml +31 -0
requirements.txt +14 -11
src/api.py +2 -4
src/database.py +48 -16
src/extraction.py +67 -13
src/ml_extraction.py +5 -3
src/models.py +34 -29
src/pipeline.py +40 -6
src/repository.py +66 -20

README.md CHANGED Viewed

@@ -46,6 +46,8 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
 - **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
 - **GPU-Accelerated OCR:** DocTR (Mindee) with automatic CUDA acceleration for faster inference in production.
 - **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
 ### 💻 Usability
@@ -91,6 +93,14 @@ The system outputs a clean JSON with the following fields:
 - `extraction_confidence`: The confidence of the extraction (0-100).
 - `validation_passed`: Whether the validation passed (true/false).
 ---
 ## 📊 Demo
@@ -156,37 +166,35 @@ _UI shows simple format hints and confidence._
 ### Prerequisites
 - Python 3.10+
-- (Optional) CUDA-capable GPU for training/inference speed
-### Installation
-1. Clone the repository
 ```bash
 git clone https://github.com/GSoumyajit2005/invoice-processor-ml
 cd invoice-processor-ml
 ```
-2. Create and Activate Virtual Environment (Recommended) Ensures the correct Python version and isolates dependencies.
-- **Linux / macOS**:
-```bash
-python3 -m venv venv
-source venv/bin/activate
-```
-- **Windows**:
 ```bash
-python -m venv venv
-.\venv\Scripts\activate
 ```
-3. Install dependencies
 ```bash
-pip install -r requirements.txt
 ```
 4. Run the web app
@@ -195,6 +203,9 @@ pip install -r requirements.txt
 streamlit run app.py
 ```
 ### Training the Model (Optional)
 To retrain the model from scratch using the provided scripts:
@@ -205,6 +216,27 @@ python scripts/train_combined.py
 (Note: Requires SROIE dataset in data/sroie)
 ## 💻 Usage
 ### Web Interface (Recommended)
@@ -290,10 +322,16 @@ print(json.dumps(result, indent=2))
                          │   Post-process   │
                          │ validate, scores │
                          └────────┬─────────┘
-                                  ▼
-                         ┌──────────────────┐
-                         │    JSON Output   │
-                         └──────────────────┘
 ```
 ## 📁 Project Structure
@@ -326,14 +364,14 @@ invoice-processor-ml/
 ├── src/
 │   ├── api.py                  # FastAPI REST endpoint for API access
 │   ├── data_loader.py          # Unified data loader for training
-│   ├── database.py             # PostgreSQL connection (scaffolded)
 │   ├── extraction.py           # Regex-based information extraction logic
 │   ├── ml_extraction.py        # ML-based extraction (LayoutLMv3 + DocTR)
-│   ├── models.py               # SQLModel tables for persistence (scaffolded)
 │   ├── pdf_utils.py            # PDF text extraction and image conversion
 │   ├── pipeline.py             # Main orchestrator for the pipeline and CLI
 │   ├── preprocessing.py        # Image preprocessing functions (grayscale, denoise)
-│   ├── repository.py           # CRUD operations for invoices (scaffolded)
 │   ├── schema.py               # Pydantic models for API response validation
 │   ├── sroie_loader.py         # SROIE dataset loading logic
 │   └── utils.py                # Utility functions (semantic hashing, etc.)
@@ -346,6 +384,10 @@ invoice-processor-ml/
 │
 ├── app.py                      # Streamlit web interface
 ├── requirements.txt            # Python dependencies
 └── README.md                   # You are Here!
 ```
@@ -364,16 +406,21 @@ invoice-processor-ml/
 ## 📈 Performance
 - **OCR Precision**: State-of-the-art hierarchical detection using **DocTR (ResNet-50)**. Outperforms Tesseract on complex/noisy layouts.
-- **ML-based Extraction**:
-  - **Accuracy**: ~83% F1 Score on SROIE + Custom Dataset.
-  - **Speed**: ~5-7s per invoice (CPU) / <1s (GPU). Prioritizes high precision over raw speed.
 ## ⚠️ Known Limitations
 1. **Layout Sensitivity**: The ML model was fine‑tuned on SROIE (retail receipts) and mychen76/invoices-and-receipts_ocr_v1 (English). Professional multi-column invoices may underperform until you fine‑tune on more diverse datasets.
 2. **Invoice Number**: SROIE dataset lacks invoice number labels. The system solves this by using the Hybrid Fallback Engine, which successfully extracts invoice numbers using Regex whenever the ML model output is empty.
 3. **Line Items/Tables**: Not trained for table extraction yet. Rule-based supports simple totals; table extraction comes later.
-4. **Inference Latency**: Using the heavy **DocTR (ResNet-50)** backbone ensures maximum accuracy but results in higher inference time (~5-7s on CPU) compared to lightweight engines.
 ## 🔮 Future Enhancements
@@ -386,7 +433,7 @@ invoice-processor-ml/
 - [x] CI/CD pipeline (GitHub Actions → HuggingFace Spaces auto-deploy)
 - [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
 - [ ] Confidence calibration and better validation rules
-- [ ] Database persistence layer (PostgreSQL - scaffolded, ready for implementation)
 ## 🛠️ Tech Stack
@@ -400,6 +447,8 @@ invoice-processor-ml/
 | Data Format      | JSON                                |
 | CI/CD            | GitHub Actions → HuggingFace Spaces |
 | Containerization | Docker                              |
 ## 📚 What I Learned

 - **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
 - **GPU-Accelerated OCR:** DocTR (Mindee) with automatic CUDA acceleration for faster inference in production.
 - **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
+- **Defensive Persistence:** Optional PostgreSQL integration that automatically saves extracted data when credentials are present, but gracefully degrades (skips saving) in serverless/demo environments like Hugging Face Spaces.
+- **Duplicate Prevention:** Implemented *Semantic Hashing* (Vendor + Date + Total + ID) to automatically detect and prevent duplicate invoice entries.
 ### 💻 Usability
 - `extraction_confidence`: The confidence of the extraction (0-100).
 - `validation_passed`: Whether the validation passed (true/false).
+### 5. Defensive Database Architecture
+To support both local development (with full persistence) and lightweight cloud demos (without databases), the system uses a **"Soft Fail" Persistence Layer**:
+1. **Connection Check:** On startup, the system checks for PostgreSQL credentials. If missing, the database engine is disabled.
+2. **Repository Guard:** All CRUD operations check for an active session. If the database is disabled, save operations are skipped silently without crashing the pipeline.
+3. **Semantic Hashing:** Before saving, a content-based hash is generated to ensure idempotency.
 ---
 ## 📊 Demo
 ### Prerequisites
 - Python 3.10+
+- Conda / Miniforge (recommended)
+- NVIDIA GPU with CUDA (strongly recommended for usable performance)
+⚠️ CPU-only execution is supported but significantly slower
+(5–10s per invoice) and intended only for testing.
+### Installation (Conda – Recommended)
+1. Clone the repository:
 ```bash
 git clone https://github.com/GSoumyajit2005/invoice-processor-ml
 cd invoice-processor-ml
 ```
+2. Create and activate the Conda environment:
 ```bash
+conda env create -f environment.yml
+conda activate invoice-ml
 ```
+3. Verify CUDA availability (recommended):
 ```bash
+python - <<EOF
+import torch
+print(torch.cuda.is_available())
+EOF
 ```
 4. Run the web app
 streamlit run app.py
 ```
+> Note: `requirements.txt` is consumed internally by `environment.yml`.
+> Do not install it manually with pip.
 ### Training the Model (Optional)
 To retrain the model from scratch using the provided scripts:
 (Note: Requires SROIE dataset in data/sroie)
+### API Usage (Optional)
+To run the API server:
+```bash
+python src/api.py
+```
+The API provides endpoints for processing invoices and extracting information.
+### Running with Database (Optional)
+To enable data persistence, run the included Docker Compose file to spin up PostgreSQL:
+```bash
+docker-compose up -d
+```
+The application will automatically detect the database and start saving invoices.
 ## 💻 Usage
 ### Web Interface (Recommended)
                          │   Post-process   │
                          │ validate, scores │
                          └────────┬─────────┘
+                                  │
+                   ┌──────────────┴──────────────┐
+                   │                             │
+                   ▼                             ▼
+         ┌──────────────────┐         ┌────────────────────┐
+         │    JSON Output   │         │  DB (PostgreSQL)   │
+         └──────────────────┘         │   (Optional Save)  │
+                                      └────────────────────┘
 ```
 ## 📁 Project Structure
 ├── src/
 │   ├── api.py                  # FastAPI REST endpoint for API access
 │   ├── data_loader.py          # Unified data loader for training
+│   ├── database.py             # Database connection with environment-aware 'soft fail' check
 │   ├── extraction.py           # Regex-based information extraction logic
 │   ├── ml_extraction.py        # ML-based extraction (LayoutLMv3 + DocTR)
+│   ├── models.py               # SQLModel tables (Invoice, LineItem) with schema validation
 │   ├── pdf_utils.py            # PDF text extraction and image conversion
 │   ├── pipeline.py             # Main orchestrator for the pipeline and CLI
 │   ├── preprocessing.py        # Image preprocessing functions (grayscale, denoise)
+│   ├── repository.py           # CRUD operations with session safety handling
 │   ├── schema.py               # Pydantic models for API response validation
 │   ├── sroie_loader.py         # SROIE dataset loading logic
 │   └── utils.py                # Utility functions (semantic hashing, etc.)
 │
 ├── app.py                      # Streamlit web interface
 ├── requirements.txt            # Python dependencies
+├── environment.yml             # Conda environment configuration
+├── docker-compose.yml          # Docker Compose configuration for PostgreSQL
+├── Dockerfile                  # Dockerfile for building the application container
+├── .gitignore                  # Git ignore file
 └── README.md                   # You are Here!
 ```
 ## 📈 Performance
 - **OCR Precision**: State-of-the-art hierarchical detection using **DocTR (ResNet-50)**. Outperforms Tesseract on complex/noisy layouts.
+- **ML-based Extraction**:
+  - **Accuracy**: ~83% F1 Score on SROIE + custom invoices
+  - **Speed**:
+    - **GPU (recommended)**: <1s per invoice
+    - **CPU (fallback)**: ~5–7s per invoice
+⚠️ CPU-only execution is supported for testing and experimentation but results
+in significantly higher latency due to the heavy OCR and layout-aware models.
 ## ⚠️ Known Limitations
 1. **Layout Sensitivity**: The ML model was fine‑tuned on SROIE (retail receipts) and mychen76/invoices-and-receipts_ocr_v1 (English). Professional multi-column invoices may underperform until you fine‑tune on more diverse datasets.
 2. **Invoice Number**: SROIE dataset lacks invoice number labels. The system solves this by using the Hybrid Fallback Engine, which successfully extracts invoice numbers using Regex whenever the ML model output is empty.
 3. **Line Items/Tables**: Not trained for table extraction yet. Rule-based supports simple totals; table extraction comes later.
+4. **Inference Latency**: CPU execution is significantly slower due to heavy OCR and layout-aware models.
 ## 🔮 Future Enhancements
 - [x] CI/CD pipeline (GitHub Actions → HuggingFace Spaces auto-deploy)
 - [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
 - [ ] Confidence calibration and better validation rules
+- [x] Database persistence layer (PostgreSQL with SQLModel & Redundancy checks)
 ## 🛠️ Tech Stack
 | Data Format      | JSON                                |
 | CI/CD            | GitHub Actions → HuggingFace Spaces |
 | Containerization | Docker                              |
+| Database         | PostgreSQL, SQLModel                |
+| Containerization | Docker & Docker Compose             |
 ## 📚 What I Learned

app.py CHANGED Viewed

@@ -7,12 +7,21 @@ from PIL import Image, ImageDraw
 import pandas as pd
 import sys
 # --------------------------------------------------
 # Pipeline import (PURE DATA ONLY)
 # --------------------------------------------------
-sys.path.append("src")
-from pipeline import process_invoice
 # --------------------------------------------------
 # Mock format detection (UI-level, safe)
@@ -119,17 +128,22 @@ with tab1:
         if uploaded_file:
             st.caption(f"File: {uploaded_file.name}")
             if uploaded_file.type == "application/pdf":
-                st.info("PDF uploaded (preview not available)")
             else:
                 image = Image.open(uploaded_file)
-                st.image(
-                    image,
-                    width=250,
-                    caption="Uploaded Invoice"
-                )
     # -----------------------------
@@ -149,12 +163,39 @@ with tab1:
                         f.write(uploaded_file.getbuffer())
                     method = "ml" if "ML" in extraction_method else "rules"
                     result = process_invoice(str(temp_path), method=method)
-                    # Hard guard — prevents DeltaGenerator bugs forever
                     if not isinstance(result, dict):
                         st.error("Pipeline returned invalid data.")
                         st.stop()
                     st.session_state.data = result
                     st.session_state.format_info = detect_invoice_format(
@@ -162,35 +203,43 @@ with tab1:
                     )
                     st.session_state.processed_count += 1
-                    st.success("Extraction Complete")
                     # --- AI Detection Overlay Visualization ---
                     raw_predictions = result.get("raw_predictions")
-                    if raw_predictions and uploaded_file.type != "application/pdf":
-                        # Reload the original image for annotation
-                        uploaded_file.seek(0)
-                        overlay_image = Image.open(uploaded_file).convert("RGB")
-                        draw = ImageDraw.Draw(overlay_image)
-                        # Draw red rectangles around each detected entity's bounding boxes
-                        for entity_name, entity_data in raw_predictions.items():
-                            bboxes = entity_data.get("bbox", [])
-                            for box in bboxes:
-                                # bbox format: [x, y, width, height]
-                                x, y, w, h = box
-                                draw.rectangle(
-                                    [x, y, x + w, y + h],
-                                    outline="red",
-                                    width=2
-                                )
-                        overlay_image.thumbnail((800, 800))
-                        st.image(
-                            overlay_image,
-                            caption="AI Detection Overlay",
-                            width="content"
-                        )
                 except Exception as e:
                     st.error(f"Pipeline error: {e}")
@@ -276,7 +325,7 @@ with tab2:
             st.image(
                 Image.open(samples[0]),
                 caption=samples[0].name,
-                use_container_width=True
             )
         else:
             st.info("No sample invoices found.")

 import pandas as pd
 import sys
+# PDF to image conversion
+try:
+    from pdf2image import convert_from_bytes
+    PDF_SUPPORT = True
+except ImportError:
+    PDF_SUPPORT = False
 # --------------------------------------------------
 # Pipeline import (PURE DATA ONLY)
 # --------------------------------------------------
+from src.pipeline import process_invoice
+from src.database import init_db
+# Initialize database
+init_db()
 # --------------------------------------------------
 # Mock format detection (UI-level, safe)
         if uploaded_file:
             st.caption(f"File: {uploaded_file.name}")
+            # Handle PDF preview
             if uploaded_file.type == "application/pdf":
+                if PDF_SUPPORT:
+                    pdf_bytes = uploaded_file.read()
+                    uploaded_file.seek(0)  # Reset for later processing
+                    pages = convert_from_bytes(pdf_bytes, first_page=1, last_page=1)
+                    if pages:
+                        pdf_preview_image = pages[0]
+                        st.session_state.pdf_preview = pdf_preview_image
+                        st.image(pdf_preview_image, width=250, caption="PDF Preview (Page 1)")
+                else:
+                    st.warning("PDF preview requires pdf2image. Install with: `pip install pdf2image`")
             else:
                 image = Image.open(uploaded_file)
+                st.image(image, width=250, caption="Uploaded Invoice")
     # -----------------------------
                         f.write(uploaded_file.getbuffer())
                     method = "ml" if "ML" in extraction_method else "rules"
+                    # CALL PIPELINE
                     result = process_invoice(str(temp_path), method=method)
+                    # --- SMART STATUS NOTIFICATIONS ---
+                    db_status = result.get('_db_status', 'disabled')
+                    if db_status == 'saved':
+                        st.success("✅ Extraction & Storage Complete")
+                        st.toast("Invoice saved to Database!", icon="💾")
+                    elif db_status == 'duplicate':
+                        st.success("✅ Extraction Complete")
+                        st.toast("Duplicate invoice (already in database)", icon="⚠️")
+                    elif db_status == 'disabled':
+                        st.success("✅ Extraction Complete")
+                        # Only show "Demo Mode" toast once per session
+                        if not st.session_state.get('_db_warning_shown', False):
+                            st.toast("Database disabled (Demo Mode)", icon="ℹ️")
+                            st.session_state['_db_warning_shown'] = True
+                    else:
+                        st.success("✅ Extraction Complete")
+                    # Hard guard — prevents DeltaGenerator bugs
                     if not isinstance(result, dict):
                         st.error("Pipeline returned invalid data.")
                         st.stop()
+                    # Remove the metadata field so it doesn't show up in the JSON view
+                    if '_db_status' in result:
+                        del result['_db_status']
                     st.session_state.data = result
                     st.session_state.format_info = detect_invoice_format(
                     )
                     st.session_state.processed_count += 1
                     # --- AI Detection Overlay Visualization ---
                     raw_predictions = result.get("raw_predictions")
+                    if raw_predictions:
+                        # Get the base image for annotation
+                        if uploaded_file.type == "application/pdf":
+                            # Use the converted PDF preview image
+                            if "pdf_preview" in st.session_state:
+                                overlay_image = st.session_state.pdf_preview.copy().convert("RGB")
+                            else:
+                                overlay_image = None
+                        else:
+                            # Reload the original image for annotation
+                            uploaded_file.seek(0)
+                            overlay_image = Image.open(uploaded_file).convert("RGB")
+                        if overlay_image:
+                            draw = ImageDraw.Draw(overlay_image)
+                            # Draw red rectangles around each detected entity's bounding boxes
+                            for entity_name, entity_data in raw_predictions.items():
+                                bboxes = entity_data.get("bbox", [])
+                                for box in bboxes:
+                                    # bbox format: [x, y, width, height]
+                                    x, y, w, h = box
+                                    draw.rectangle(
+                                        [x, y, x + w, y + h],
+                                        outline="red",
+                                        width=2
+                                    )
+                            overlay_image.thumbnail((800, 800))
+                            st.image(
+                                overlay_image,
+                                caption="AI Detection Overlay",
+                                width="content"
+                            )
                 except Exception as e:
                     st.error(f"Pipeline error: {e}")
             st.image(
                 Image.open(samples[0]),
                 caption=samples[0].name,
+                width=250
             )
         else:
             st.info("No sample invoices found.")

docker-compose.yml CHANGED Viewed

@@ -11,7 +11,7 @@ services:
       POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-password}
       POSTGRES_DB: ${POSTGRES_DB:-invoices_db}
     ports:
-      - "5432:5432"
     volumes:
       - postgres_data:/var/lib/postgresql/data
     healthcheck:

       POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-password}
       POSTGRES_DB: ${POSTGRES_DB:-invoices_db}
     ports:
+      - "5433:5432"
     volumes:
       - postgres_data:/var/lib/postgresql/data
     healthcheck:

environment.yml ADDED Viewed

	@@ -0,0 +1,31 @@

+name: invoice-ml
+channels:
+  - pytorch
+  - nvidia
+  - conda-forge
+dependencies:
+  # ----- Python -----
+  - python=3.10
+  - pip
+  # ----- CUDA-enabled PyTorch -----
+  - pytorch
+  - torchvision
+  - torchaudio
+  - pytorch-cuda=11.8
+  # ----- Core numeric / system -----
+  - numpy
+  - certifi
+  - openssl
+  - ca-certificates
+  # ----- Computer Vision / PDF -----
+  - poppler
+  - ghostscript
+  # ----- App-level Python deps -----
+  - pip:
+      - -r requirements.txt

requirements.txt CHANGED Viewed

@@ -1,23 +1,26 @@
-# ----- Streamlit -----
 streamlit>=1.28.0
-# ----- OCR -----
-python-doctr[torch]>=0.8.0
-opencv-python>=4.8.0
 Pillow>=10.0.0
-# ----- Data -----
-numpy>=1.24.0
-pandas>=2.0.0
-# ----- Machine Learning -----
-torch>=2.0.0
-torchvision>=0.15.0
 transformers>=4.30.0
 datasets>=2.14.0
 huggingface-hub>=0.17.0
 seqeval>=1.2.2
 # ----- Data Validation -----
 pydantic>=2.12.0

+# ----- Streamlit -----
 streamlit>=1.28.0
+# ----- OCR -----
+python-doctr>=0.8.0
+opencv-python-headless>=4.8.0
 Pillow>=10.0.0
+# ----- NLP / Transformers -----
 transformers>=4.30.0
 datasets>=2.14.0
 huggingface-hub>=0.17.0
 seqeval>=1.2.2
+# ----- Utilities -----
+python-dotenv>=1.0.0
+httpx>=0.28.0
+tenacity>=8.0.0
+validators>=0.22.0
+langdetect>=1.0.9
+RapidFuzz>=3.0.0
+python-dateutil>=2.9.0
 # ----- Data Validation -----
 pydantic>=2.12.0

src/api.py CHANGED Viewed

@@ -8,10 +8,8 @@ from pathlib import Path
 import uuid
 import sys
-# Import modules
-sys.path.append(str(Path(__file__).resolve().parent))
-from pipeline import process_invoice
-from schema import InvoiceData
 app = FastAPI(
     title="Invoice Extraction API",

 import uuid
 import sys
+from src.pipeline import process_invoice
+from src.schema import InvoiceData
 app = FastAPI(
     title="Invoice Extraction API",

src/database.py CHANGED Viewed

@@ -1,33 +1,65 @@
 # src/database.py
 from sqlmodel import SQLModel, create_engine, Session
-from typing import Generator
 import os
-# --- INSTRUCTIONS ---
-# 1. Get credentials from environment variables (Os.getenv)
-# 2. Construct the DATABASE_URL string: postgresql://user:pass@host:port/db
-# 3. Create the SQLModel engine
-# 4. Implement the init_db and get_session functions
-# TODO: Define constants for DB params (POSTGRES_USER, etc.)
-# TODO: Define DATABASE_URL
-# TODO: Create 'engine' using create_engine(DATABASE_URL)
 def init_db():
     """
     Idempotent DB initialization.
-    Should create all tables defined in SQLModel metadata.
     """
-    # TODO: Call SQLModel.metadata.create_all(engine)
-    pass
-def get_session() -> Generator[Session, None, None]:
     """
     Dependency for yielding a database session.
-    Useful for FastAPI dependencies or context usage.
     """
-    # TODO: Open a session with the engine, yield it, and ensure it closes
-    pass

 # src/database.py
 from sqlmodel import SQLModel, create_engine, Session
+from sqlalchemy import text
+from typing import Generator, Optional
 import os
+from dotenv import load_dotenv
+load_dotenv()
+# 1. Get credentials (with defaults to avoid immediate crashes if vars are missing)
+POSTGRES_USER = os.getenv("POSTGRES_USER")
+POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD")
+POSTGRES_DB = os.getenv("POSTGRES_DB")
+POSTGRES_HOST = os.getenv("POSTGRES_HOST", "localhost")
+POSTGRES_PORT = os.getenv("POSTGRES_PORT", "5432")
+# 2. Construct DATABASE_URL conditionally
+DATABASE_URL = None
+engine = None
+DB_CONNECTED = False  # Track actual connection status
+if POSTGRES_USER and POSTGRES_PASSWORD and POSTGRES_DB:
+    DATABASE_URL = f"postgresql://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"
+    try:
+        # 3. Create the engine only if we have credentials
+        engine = create_engine(DATABASE_URL, echo=False)
+        # 4. Test actual connection (once at startup)
+        with engine.connect() as conn:
+            conn.execute(text("SELECT 1"))
+        DB_CONNECTED = True
+        print("✅ Database connection verified.")
+    except Exception as e:
+        print(f"⚠️ Database unavailable: {e}")
+        DB_CONNECTED = False
+else:
+    print("⚠️ Database credentials missing. Database features will be disabled.")
 def init_db():
     """
     Idempotent DB initialization.
+    Only runs if engine is successfully configured AND connected.
     """
+    if engine and DB_CONNECTED:
+        try:
+            SQLModel.metadata.create_all(engine)
+            print("✅ Database tables created/verified.")
+        except Exception as e:
+            print(f"❌ Error initializing database: {e}")
+    # Silent skip when DB is not connected - message already shown at startup
+def get_session() -> Generator[Optional[Session], None, None]:
     """
     Dependency for yielding a database session.
+    Yields None if database is not configured.
     """
+    if engine:
+        with Session(engine) as session:
+            yield session
+    else:
+        # Yield None so code depending on this doesn't crash immediately,
+        # but can check 'if session is None'.
+        yield None

src/extraction.py CHANGED Viewed

@@ -7,29 +7,83 @@ from difflib import SequenceMatcher
 def extract_dates(text: str) -> List[str]:
     """
-    Robust date extraction that handles noisy OCR separators (spaces, pipes, dots)
-    and validates using datetime to ensure semantic correctness.
     """
     if not text: return []
-    # Matches DD/MM/YYYY, DD-MM-YYYY, DD.MM.YYYY, DD MM YYYY
-    # Also handles OCR noise like pipes (|) instead of slashes
-    pattern = r'\b(\d{1,2})[\s/|.-](\d{1,2})[\s/|.-](\d{2,4})\b'
-    matches = re.findall(pattern, text)
     valid_dates = []
-    for d, m, y in matches:
         try:
-            # Try to parse it to check if it's a real date
-            # This filters out "99/99/2000" or random phone numbers like 12 34 5678
-            # Assuming Day-Month-Year format which is common in SROIE/International
-            # For US format, you might swap d and m
             dt = datetime(int(y), int(m), int(d))
             valid_dates.append(dt.strftime("%d/%m/%Y"))
         except ValueError:
-            continue # Invalid date logic (e.g. Month 13 or Day 32)
-    return list(dict.fromkeys(valid_dates)) # Deduplicate
 def extract_amounts(text:  str) -> List[float]:
     if not text: return []

 def extract_dates(text: str) -> List[str]:
     """
+    Robust date extraction that handles:
+    - Numeric formats: DD/MM/YYYY, DD-MM-YYYY, DD.MM.YYYY
+    - Text month formats: 22 Mar 18, March 22, 2018, 22-Mar-2018
+    - OCR noise like pipes (|) instead of slashes
+    Validates using datetime to ensure semantic correctness.
     """
     if not text: return []
+    # Month name mappings
+    MONTH_MAP = {
+        'jan': 1, 'january': 1,
+        'feb': 2, 'february': 2,
+        'mar': 3, 'march': 3,
+        'apr': 4, 'april': 4,
+        'may': 5,
+        'jun': 6, 'june': 6,
+        'jul': 7, 'july': 7,
+        'aug': 8, 'august': 8,
+        'sep': 9, 'sept': 9, 'september': 9,
+        'oct': 10, 'october': 10,
+        'nov': 11, 'november': 11,
+        'dec': 12, 'december': 12
+    }
     valid_dates = []
+    # Pattern 1: Numeric dates - DD/MM/YYYY, DD-MM-YYYY, DD.MM.YYYY, DD MM YYYY
+    # Also handles OCR noise like pipes (|) instead of slashes
+    numeric_pattern = r'\b(\d{1,2})[\s/|.-](\d{1,2})[\s/|.-](\d{2,4})\b'
+    for d, m, y in re.findall(numeric_pattern, text):
+        try:
+            year = int(y)
+            if year < 100:
+                year = 2000 + year if year < 50 else 1900 + year
+            dt = datetime(year, int(m), int(d))
+            valid_dates.append(dt.strftime("%d/%m/%Y"))
+        except ValueError:
+            continue
+    # Pattern 2: DD Mon YY/YYYY (e.g., "22 Mar 18", "22-Mar-2018", "22 March 2018")
+    text_month_pattern1 = r'\b(\d{1,2})[\s/.-]?([A-Za-z]{3,9})[\s/.-]?(\d{2,4})\b'
+    for d, m, y in re.findall(text_month_pattern1, text, re.IGNORECASE):
+        month_num = MONTH_MAP.get(m.lower())
+        if month_num:
+            try:
+                year = int(y)
+                if year < 100:
+                    year = 2000 + year if year < 50 else 1900 + year
+                dt = datetime(year, month_num, int(d))
+                valid_dates.append(dt.strftime("%d/%m/%Y"))
+            except ValueError:
+                continue
+    # Pattern 3: Mon DD, YYYY (e.g., "March 22, 2018", "Mar 22 2018")
+    text_month_pattern2 = r'\b([A-Za-z]{3,9})[\s.-]?(\d{1,2})[,\s.-]+(\d{2,4})\b'
+    for m, d, y in re.findall(text_month_pattern2, text, re.IGNORECASE):
+        month_num = MONTH_MAP.get(m.lower())
+        if month_num:
+            try:
+                year = int(y)
+                if year < 100:
+                    year = 2000 + year if year < 50 else 1900 + year
+                dt = datetime(year, month_num, int(d))
+                valid_dates.append(dt.strftime("%d/%m/%Y"))
+            except ValueError:
+                continue
+    # Pattern 4: YYYY-MM-DD (ISO format)
+    iso_pattern = r'\b(\d{4})[-/](\d{1,2})[-/](\d{1,2})\b'
+    for y, m, d in re.findall(iso_pattern, text):
         try:
             dt = datetime(int(y), int(m), int(d))
             valid_dates.append(dt.strftime("%d/%m/%Y"))
         except ValueError:
+            continue
+    return list(dict.fromkeys(valid_dates))  # Deduplicate while preserving order
 def extract_amounts(text:  str) -> List[float]:
     if not text: return []

src/ml_extraction.py CHANGED Viewed

@@ -8,7 +8,7 @@ from PIL import Image
 from typing import List, Dict, Any, Tuple
 import re
 import numpy as np
-from extraction import extract_invoice_number, extract_total, extract_address
 from doctr.io import DocumentFile
 from doctr.models import ocr_predictor
@@ -219,10 +219,12 @@ def extract_ml_based(image_path: str) -> Dict[str, Any]:
     encoding = PROCESSOR(
         image, text=words, boxes=normalized_boxes,
         truncation=True, max_length=512, return_tensors="pt"
-    ).to(DEVICE)
     with torch.no_grad():
-        outputs = MODEL(**encoding)
     predictions = outputs.logits.argmax(-1).squeeze().tolist()
     extracted_entities = _process_predictions(words, unnormalized_boxes, encoding, predictions, MODEL.config.id2label)

 from typing import List, Dict, Any, Tuple
 import re
 import numpy as np
+from src.extraction import extract_invoice_number, extract_total, extract_address
 from doctr.io import DocumentFile
 from doctr.models import ocr_predictor
     encoding = PROCESSOR(
         image, text=words, boxes=normalized_boxes,
         truncation=True, max_length=512, return_tensors="pt"
+    )
+    # Move tensors to device for inference, but keep original encoding for word_ids()
+    model_inputs = {k: v.to(DEVICE) for k, v in encoding.items()}
     with torch.no_grad():
+        outputs = MODEL(**model_inputs)
     predictions = outputs.logits.argmax(-1).squeeze().tolist()
     extracted_entities = _process_predictions(words, unnormalized_boxes, encoding, predictions, MODEL.config.id2label)

src/models.py CHANGED Viewed

@@ -2,49 +2,54 @@
 from typing import List, Optional
 from datetime import date as DateType
 from decimal import Decimal
 from sqlmodel import SQLModel, Field, Relationship
-# --- INSTRUCTIONS ---
-# SQLModel classes should mirror the Pydantic models in src/schema.py
-# but with database-specific configurations (primary keys, foreign keys).
 class Invoice(SQLModel, table=True):
     __tablename__ = "invoices"
-    # TODO: Define Primary Key 'id' (int, optional, default=None)
-    # TODO: Add Data Fields
-    # - receipt_number (str, indexed)
-    # - date (DateType)
-    # - total_amount (Decimal, max_digits=10, decimal_places=2)
-    # - vendor (str)
-    # - address (str)
-    # - semantic_hash (str, unique, indexed) -> Critical for deduplication
-    # TODO: Add Metadata Fields
-    # - validation_status (str)
-    # - validation_errors (str) -> Store as JSON string since we don't need to query inside it yet
-    # - created_at (DateType) -> Default to today
-    # TODO: Define relationship to LineItem (One-to-Many)
-    # items: List["LineItem"] = Relationship(...)
-    pass
 class LineItem(SQLModel, table=True):
     __tablename__ = "line_items"
-    # TODO: Define Primary Key
-    # TODO: Define Foreign Key 'invoice_id' linking to 'invoices.id'
-    # TODO: Add Data Fields
-    # - description (str)
-    # - quantity (int)
-    # - unit_price (Decimal)
-    # - total (Decimal)
-    # TODO: Define relationship back to Invoice
-    # invoice: Optional[Invoice] = Relationship(...)
-    pass

 from typing import List, Optional
 from datetime import date as DateType
+from datetime import datetime
 from decimal import Decimal
 from sqlmodel import SQLModel, Field, Relationship
+SQLModel.metadata.clear()
 class Invoice(SQLModel, table=True):
     __tablename__ = "invoices"
+    __table_args__ = {"extend_existing": True}
+    # Primary Key
+    id: Optional[int] = Field(default=None, primary_key=True)
+    # Data Fields
+    receipt_number: Optional[str] = Field(default=None, index=True)
+    date: Optional[DateType] = Field(default=None)
+    total_amount: Optional[Decimal] = Field(default=None, max_digits=10, decimal_places=2)
+    vendor: Optional[str] = Field(default=None)
+    address: Optional[str] = Field(default=None)
+    # Critical for Deduplication
+    semantic_hash: str = Field(unique=True, index=True)
+    # Metadata Fields
+    validation_status: str = Field(default="unknown")
+    # Store validation_errors as a JSON string because SQLModel/SQLite doesn't always support arrays out of the box
+    validation_errors: Optional[str] = Field(default="[]")
+    created_at: DateType = Field(default_factory=datetime.now)
+    # Relationship to LineItem (One-to-Many)
+    items: List["LineItem"] = Relationship(back_populates="invoice")
 class LineItem(SQLModel, table=True):
     __tablename__ = "line_items"
+    __table_args__ = {"extend_existing": True}
+    # Primary Key
+    id: Optional[int] = Field(default=None, primary_key=True)
+    # Foreign Key
+    invoice_id: Optional[int] = Field(default=None, foreign_key="invoices.id")
+    # Data Fields
+    description: str
+    quantity: int = Field(default=1)
+    unit_price: Optional[Decimal] = Field(default=None, max_digits=10, decimal_places=2)
+    total: Optional[Decimal] = Field(default=None, max_digits=10, decimal_places=2)
+    # Relationship back to Invoice
+    invoice: Optional[Invoice] = Relationship(back_populates="items")

src/pipeline.py CHANGED Viewed

@@ -12,12 +12,14 @@ from pydantic import ValidationError
 import cv2
 # --- IMPORTS ---
-from preprocessing import load_image, convert_to_grayscale, remove_noise
-from extraction import structure_output
-from ml_extraction import extract_ml_based
-from schema import InvoiceData
-from pdf_utils import extract_text_from_pdf, convert_pdf_to_images
-from utils import generate_semantic_hash
 def process_invoice(image_path: str,
                    method: str = 'ml',
@@ -136,6 +138,38 @@ def process_invoice(image_path: str,
     # We calculate the hash based on the final (or raw) data.
     # This gives us a unique fingerprint for this specific business transaction.
     final_data['semantic_hash'] = generate_semantic_hash(final_data)
     # --- SAVING STEP ---
     if save_results:

 import cv2
 # --- IMPORTS ---
+from src.preprocessing import load_image, convert_to_grayscale, remove_noise
+from src.extraction import structure_output
+from src.ml_extraction import extract_ml_based
+from src.schema import InvoiceData
+from src.pdf_utils import extract_text_from_pdf, convert_pdf_to_images
+from src.utils import generate_semantic_hash
+from src.repository import InvoiceRepository
+from src.database import DB_CONNECTED
 def process_invoice(image_path: str,
                    method: str = 'ml',
     # We calculate the hash based on the final (or raw) data.
     # This gives us a unique fingerprint for this specific business transaction.
     final_data['semantic_hash'] = generate_semantic_hash(final_data)
+    # --- DATABASE SAVE (The Integration) ---
+    if not DB_CONNECTED:
+        # Database not available - skip save entirely (message shown once at startup)
+        final_data['_db_status'] = 'disabled'
+    else:
+        final_data['_db_status'] = 'disabled'  # Default assumption
+        try:
+            print("💾 Attempting to save to Database...")
+            repo = InvoiceRepository()
+            if repo.session:
+                saved_record = repo.save_invoice(final_data)
+                if saved_record:
+                    print(f"   ✅ Successfully saved Invoice #{saved_record.id}")
+                    final_data['_db_status'] = 'saved'
+                else:
+                    # Check if it's a duplicate by looking up the hash
+                    existing = repo.get_by_hash(final_data.get('semantic_hash', ''))
+                    if existing:
+                        print("   ⚠️  Duplicate invoice detected (already in database)")
+                        final_data['_db_status'] = 'duplicate'
+                    else:
+                        print("   ⚠️  Save failed (unknown error)")
+                        final_data['_db_status'] = 'error'
+            else:
+                print("   ⚠️  Skipped DB Save (Database disabled)")
+                final_data['_db_status'] = 'disabled'
+        except Exception as e:
+            print(f"   ⚠️  Database Error (Ignored): {e}")
+            final_data['_db_status'] = 'error'
     # --- SAVING STEP ---
     if save_results:

src/repository.py CHANGED Viewed

@@ -3,39 +3,85 @@
 from sqlmodel import Session, select
 from typing import Dict, Any, Optional
 import json
 from src.models import Invoice, LineItem
-from src.database import get_session, engine
 class InvoiceRepository:
-    def __init__(self, session: Session = None):
         """
         Initialize with an optional session.
-        Allows dependency injection for testing or API usage.
         """
-        self.session = session
-    def save_invoice(self, invoice_data: Dict[str, Any]) -> Invoice:
         """
         Saves an invoice and its line items to the database.
-        Steps to implement:
-        1. Manage Session: If self.session is None, create a new one using 'engine'.
-        2. Clean Data: Separate 'items' list from the main invoice properties.
-        3. Create Invoice: Instantiate the Invoice SQLModel.
-        4. Deserialize Complex Types: e.g. 'validation_errors' list -> JSON string.
-        5. Process Items: Iterate 'items', create LineItem models, check keys match, and append to invoice.items.
-        6. Commit: Add to session, commit, and refresh.
-        7. Error Handling: Wrap in try/except to rollback on failure.
         """
-        # TODO: Implementation
-        raise NotImplementedError("Implement the save logic.")
     def get_by_hash(self, semantic_hash: str) -> Optional[Invoice]:
         """
         Check if invoice already exists using the semantic hash.
         """
-        # TODO: Create session if needed
-        # TODO: Execute SELECT statement filtering by hash
-        # TODO: Return first result or None
-        raise NotImplementedError("Implement the query logic.")

 from sqlmodel import Session, select
 from typing import Dict, Any, Optional
 import json
+from datetime import date
 from src.models import Invoice, LineItem
+from src.database import get_session, engine, DB_CONNECTED
 class InvoiceRepository:
+    def __init__(self, session: Optional[Session] = None):
         """
         Initialize with an optional session.
+        If no session is provided, try to get a new one from the engine.
+        Only creates session if database is actually connected.
         """
+        if session:
+            self.session = session
+        elif engine and DB_CONNECTED:
+            self.session = Session(engine)
+        else:
+            self.session = None
+    def save_invoice(self, invoice_data: Dict[str, Any]) -> Optional[Invoice]:
         """
         Saves an invoice and its line items to the database.
+        Returns the saved Invoice object or None if DB is disabled/failed.
         """
+        if not self.session:
+            print("⚠️ DB Session missing. Skipping save.")
+            return None
+        try:
+            # 1. Prepare Data
+            data = invoice_data.copy()
+            # Serialize complex types (validation_errors)
+            if 'validation_errors' in data and isinstance(data['validation_errors'], list):
+                data['validation_errors'] = json.dumps(data['validation_errors'])
+            # Extract items to process separately
+            items_data = data.pop('items', [])
+            # 2. Create Invoice Record
+            invoice = Invoice(**data)
+            # 3. Process Items
+            for item in items_data:
+                # Ensure item is a dict (if it's a Pydantic model, convert it)
+                if hasattr(item, 'model_dump'):
+                    item_dict = item.model_dump()
+                elif isinstance(item, dict):
+                    item_dict = item
+                else:
+                    continue
+                line_item = LineItem(**item_dict)
+                invoice.items.append(line_item)
+            # 4. Commit
+            self.session.add(invoice)
+            self.session.commit()
+            self.session.refresh(invoice)
+            print(f"✅ Invoice {invoice.id} saved to DB.")
+            return invoice
+        except Exception as e:
+            print(f"❌ Error saving invoice to DB: {e}")
+            self.session.rollback()
+            return None
     def get_by_hash(self, semantic_hash: str) -> Optional[Invoice]:
         """
         Check if invoice already exists using the semantic hash.
         """
+        if not self.session:
+            return None
+        try:
+            statement = select(Invoice).where(Invoice.semantic_hash == semantic_hash)
+            results = self.session.exec(statement)
+            return results.first()
+        except Exception as e:
+            print(f"❌ Error checking hash: {e}")
+            return None