Spaces:

GSoumyajit2005
/

invoice-processor-ml

Sleeping

App Files Files Community

GSoumyajit2005 commited on Jan 30

Commit

90dbe20

1 Parent(s): 8f86a3c

feat: added bulk processing, html reporting, and geometric table extraction

Browse files

Files changed (5) hide show

README.md +5 -4
app.py +150 -101
src/ml_extraction.py +8 -2
src/report_generator.py +298 -0
src/table_extraction.py +144 -0

README.md CHANGED Viewed

@@ -46,12 +46,14 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
 - **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
 - **GPU-Accelerated OCR:** DocTR (Mindee) with automatic CUDA acceleration for faster inference in production.
 - **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
-- **Defensive Persistence:** Optional PostgreSQL integration that automatically saves extracted data when credentials are present, but gracefully degrades (skips saving) in serverless/demo environments like Hugging Face Spaces.
-- **Duplicate Prevention:** Implemented *Semantic Hashing* (Vendor + Date + Total + ID) to automatically detect and prevent duplicate invoice entries.
 ### 💻 Usability
 - **Streamlit Web UI:** Interactive dashboard for real-time inference, visualization, and side-by-side comparison (ML vs. Regex).
 - **CLI & Batch Processing:** Process single files or entire directories via command line with JSON export.
 - **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.
@@ -236,7 +238,6 @@ docker-compose up -d
 The application will automatically detect the database and start saving invoices.
 ## 💻 Usage
 ### Web Interface (Recommended)
@@ -428,7 +429,7 @@ in significantly higher latency due to the heavy OCR and layout-aware models.
 - [ ] (Optional) Add FATURA (table-focused) for line-item extraction
 - [ ] Sliding-window chunking for >512 token documents (to avoid truncation)
 - [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
-- [ ] PDF support (pdf2image) for multipage invoices
 - [x] FastAPI backend + Docker
 - [x] CI/CD pipeline (GitHub Actions → HuggingFace Spaces auto-deploy)
 - [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning

 - **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
 - **GPU-Accelerated OCR:** DocTR (Mindee) with automatic CUDA acceleration for faster inference in production.
 - **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
+- **Defensive Persistence:** Optional PostgreSQL integration (local Docker or cloud Supabase) that automatically saves extracted data when credentials are present, but gracefully degrades (skips saving) in serverless/demo environments.
+- **Async Database Saves:** Background thread processing ensures fast UI response (~5-7s) while database operations happen asynchronously.
+- **Duplicate Prevention:** Implemented _Semantic Hashing_ (Vendor + Date + Total + ID) to automatically detect and prevent duplicate invoice entries.
 ### 💻 Usability
 - **Streamlit Web UI:** Interactive dashboard for real-time inference, visualization, and side-by-side comparison (ML vs. Regex).
+- **PDF Preview & Overlay:** Visual preview of uploaded PDFs with ML-detected bounding boxes overlay for transparency.
 - **CLI & Batch Processing:** Process single files or entire directories via command line with JSON export.
 - **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.
 The application will automatically detect the database and start saving invoices.
 ## 💻 Usage
 ### Web Interface (Recommended)
 - [ ] (Optional) Add FATURA (table-focused) for line-item extraction
 - [ ] Sliding-window chunking for >512 token documents (to avoid truncation)
 - [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
+- [x] PDF support (pdf2image) for multipage invoices
 - [x] FastAPI backend + Docker
 - [x] CI/CD pipeline (GitHub Actions → HuggingFace Spaces auto-deploy)
 - [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning

app.py CHANGED Viewed

@@ -6,6 +6,7 @@ from pathlib import Path
 from PIL import Image, ImageDraw
 import pandas as pd
 import sys
 # PDF to image conversion
 try:
@@ -126,19 +127,75 @@ with tab1:
     with col_left:
         st.subheader("1. Upload Invoice")
-        uploaded_file = st.file_uploader(
-            "Upload JPG, PNG, or PDF",
-            type=["jpg", "jpeg", "png", "pdf"]
         )
-        if uploaded_file:
-            st.caption(f"File: {uploaded_file.name}")
             # Handle PDF preview
-            if uploaded_file.type == "application/pdf":
                 if PDF_SUPPORT:
-                    pdf_bytes = uploaded_file.read()
-                    uploaded_file.seek(0)  # Reset for later processing
                     pages = convert_from_bytes(pdf_bytes, first_page=1, last_page=1)
                     if pages:
                         pdf_preview_image = pages[0]
@@ -147,7 +204,8 @@ with tab1:
                 else:
                     st.warning("PDF preview requires pdf2image. Install with: `pip install pdf2image`")
             else:
-                image = Image.open(uploaded_file)
                 st.image(image, width=250, caption="Uploaded Invoice")
@@ -157,101 +215,84 @@ with tab1:
     with col_right:
         st.subheader("2. Extraction Results")
-        if uploaded_file and st.button("✨ Extract Data", type="primary"):
-            with st.spinner("Running invoice extraction pipeline..."):
-                try:
-                    temp_dir = Path("temp")
-                    temp_dir.mkdir(exist_ok=True)
-                    temp_path = temp_dir / uploaded_file.name
-                    with open(temp_path, "wb") as f:
-                        f.write(uploaded_file.getbuffer())
-                    method = "ml" if "ML" in extraction_method else "rules"
-                    # CALL PIPELINE
-                    result = process_invoice(str(temp_path), method=method)
-                    # --- SMART STATUS NOTIFICATIONS ---
-                    db_status = result.get('_db_status', 'disabled')
-                    if db_status == 'saved':
-                        st.success("✅ Extraction & Storage Complete")
-                        st.toast("Invoice saved to Database!", icon="💾")
-                    elif db_status == 'queued':
-                        st.success("✅ Extraction Complete")
-                        st.toast("Saving to database...", icon="💾")
-                    elif db_status == 'duplicate':
-                        st.success("✅ Extraction Complete")
-                        st.toast("Duplicate invoice (already in database)", icon="⚠️")
-                    elif db_status == 'disabled':
-                        st.success("✅ Extraction Complete")
-                        # Only show "Demo Mode" toast once per session
-                        if not st.session_state.get('_db_warning_shown', False):
-                            st.toast("Database disabled (Demo Mode)", icon="ℹ️")
-                            st.session_state['_db_warning_shown'] = True
-                    else:
-                        st.success("✅ Extraction Complete")
-                    # Hard guard — prevents DeltaGenerator bugs
-                    if not isinstance(result, dict):
-                        st.error("Pipeline returned invalid data.")
-                        st.stop()
-                    # Remove the metadata field so it doesn't show up in the JSON view
-                    if '_db_status' in result:
-                        del result['_db_status']
-                    st.session_state.data = result
-                    st.session_state.format_info = detect_invoice_format(
-                        result.get("raw_text", "")
-                    )
-                    st.session_state.processed_count += 1
-                    # --- AI Detection Overlay Visualization ---
-                    raw_predictions = result.get("raw_predictions")
-                    if raw_predictions:
-                        # Get the base image for annotation
-                        if uploaded_file.type == "application/pdf":
-                            # Use the converted PDF preview image
-                            if "pdf_preview" in st.session_state:
-                                overlay_image = st.session_state.pdf_preview.copy().convert("RGB")
-                            else:
-                                overlay_image = None
                         else:
-                            # Reload the original image for annotation
-                            uploaded_file.seek(0)
-                            overlay_image = Image.open(uploaded_file).convert("RGB")
-                        if overlay_image:
-                            draw = ImageDraw.Draw(overlay_image)
-                            # Draw red rectangles around each detected entity's bounding boxes
-                            for entity_name, entity_data in raw_predictions.items():
-                                bboxes = entity_data.get("bbox", [])
-                                for box in bboxes:
-                                    # bbox format: [x, y, width, height]
-                                    x, y, w, h = box
-                                    draw.rectangle(
-                                        [x, y, x + w, y + h],
-                                        outline="red",
-                                        width=2
-                                    )
-                            overlay_image.thumbnail((800, 800))
-                            st.image(
-                                overlay_image,
-                                caption="AI Detection Overlay",
-                                width="content"
-                            )
-                except Exception as e:
-                    st.error(f"Pipeline error: {e}")
         # -----------------------------
         # Render Results
@@ -290,7 +331,7 @@ with tab1:
             st.subheader("🛒 Line Items")
             items = data.get("items", [])
             if items:
-                st.dataframe(pd.DataFrame(items), use_container_width=True)
             else:
                 st.info("No line items extracted.")
@@ -317,6 +358,14 @@ with tab1:
                 mime="application/json"
             )
             with st.expander("📝 Raw OCR Text"):
                 st.text(data.get("raw_text", "No OCR text available"))

 from PIL import Image, ImageDraw
 import pandas as pd
 import sys
+from src.report_generator import generate_bulk_html_report
 # PDF to image conversion
 try:
     with col_left:
         st.subheader("1. Upload Invoice")
+        # 1. Allow Multiple Files
+        uploaded_files = st.file_uploader(
+            "Upload Invoices (Bulk Supported)",
+            type=["jpg", "jpeg", "png", "pdf"],
+            accept_multiple_files=True
         )
+        if "bulk_results" not in st.session_state:
+            st.session_state.bulk_results = None
+        if uploaded_files and st.button("✨ Process All Files", type="primary"):
+            all_results = []
+            progress_bar = st.progress(0)
+            status_text = st.empty()
+            with st.spinner(f"Processing {len(uploaded_files)} documents..."):
+                temp_dir = Path("temp")
+                temp_dir.mkdir(exist_ok=True)
+                for i, uploaded_file in enumerate(uploaded_files):
+                    status_text.text(f"Processing file {i+1}/{len(uploaded_files)}: {uploaded_file.name}")
+                    # Save temp file
+                    temp_path = temp_dir / uploaded_file.name
+                    with open(temp_path, "wb") as f:
+                        f.write(uploaded_file.getbuffer())
+                    # Run Pipeline
+                    try:
+                        # Use 'ml' method as per the requirement
+                        result = process_invoice(str(temp_path), method='ml')
+                        all_results.append(result)
+                    except Exception as e:
+                        st.error(f"Error processing {uploaded_file.name}: {e}")
+                    # Update Progress
+                    progress_bar.progress((i + 1) / len(uploaded_files))
+            st.success("✅ Bulk Processing Complete!")
+            st.session_state.bulk_results = all_results
+        if st.session_state.bulk_results:
+            # Generate Report
+            html_report = generate_bulk_html_report(st.session_state.bulk_results)
+            # Download Button for the HTML
+            st.download_button(
+                label="📥 Download Bulk HTML Report",
+                data=html_report,
+                file_name="bulk_invoice_report.html",
+                mime="text/html"
+            )
+            # Display Summary Table in UI
+            st.subheader("Summary")
+            df = pd.DataFrame(st.session_state.bulk_results)
+            if not df.empty:
+                # Select clean columns for display
+                cols = [c for c in ["vendor", "date", "total_amount", "validation_status"] if c in df.columns]
+                st.dataframe(df[cols], width='stretch')
+        # Preview first file (if any files selected)
+        if uploaded_files:
+            first_file = uploaded_files[0]
+            st.caption(f"Preview: {first_file.name}" + (f" (+{len(uploaded_files)-1} more)" if len(uploaded_files) > 1 else ""))
             # Handle PDF preview
+            if first_file.type == "application/pdf":
                 if PDF_SUPPORT:
+                    pdf_bytes = first_file.read()
+                    first_file.seek(0)  # Reset for later processing
                     pages = convert_from_bytes(pdf_bytes, first_page=1, last_page=1)
                     if pages:
                         pdf_preview_image = pages[0]
                 else:
                     st.warning("PDF preview requires pdf2image. Install with: `pip install pdf2image`")
             else:
+                image = Image.open(first_file)
+                first_file.seek(0)  # Reset for later processing
                 st.image(image, width=250, caption="Uploaded Invoice")
     with col_right:
         st.subheader("2. Extraction Results")
+        # Single-file extraction (original functionality)
+        # Works when exactly 1 file is uploaded
+        if uploaded_files and len(uploaded_files) == 1:
+            single_file = uploaded_files[0]
+            if st.button("✨ Extract Data", type="primary"):
+                with st.spinner("Running invoice extraction pipeline..."):
+                    try:
+                        temp_dir = Path("temp")
+                        temp_dir.mkdir(exist_ok=True)
+                        temp_path = temp_dir / single_file.name
+                        with open(temp_path, "wb") as f:
+                            f.write(single_file.getbuffer())
+                        method = "ml" if "ML" in extraction_method else "rules"
+                        # CALL PIPELINE
+                        result = process_invoice(str(temp_path), method=method)
+                        # --- SMART STATUS NOTIFICATIONS ---
+                        db_status = result.get('_db_status', 'disabled')
+                        if db_status == 'saved':
+                            st.success("✅ Extraction & Storage Complete")
+                            st.toast("Invoice saved to Database!", icon="💾")
+                        elif db_status == 'queued':
+                            st.success("✅ Extraction Complete")
+                            st.toast("Saving to database...", icon="💾")
+                        elif db_status == 'duplicate':
+                            st.success("✅ Extraction Complete")
+                            st.toast("Duplicate invoice (already in database)", icon="⚠️")
+                        elif db_status == 'disabled':
+                            st.success("✅ Extraction Complete")
+                            if not st.session_state.get('_db_warning_shown', False):
+                                st.toast("Database disabled (Demo Mode)", icon="ℹ️")
+                                st.session_state['_db_warning_shown'] = True
                         else:
+                            st.success("✅ Extraction Complete")
+                        # Hard guard
+                        if not isinstance(result, dict):
+                            st.error("Pipeline returned invalid data.")
+                            st.stop()
+                        if '_db_status' in result:
+                            del result['_db_status']
+                        st.session_state.data = result
+                        st.session_state.format_info = detect_invoice_format(
+                            result.get("raw_text", "")
+                        )
+                        st.session_state.processed_count += 1
+                        # --- AI Detection Overlay Visualization ---
+                        raw_predictions = result.get("raw_predictions")
+                        if raw_predictions:
+                            if single_file.type == "application/pdf":
+                                if "pdf_preview" in st.session_state:
+                                    overlay_image = st.session_state.pdf_preview.copy().convert("RGB")
+                                else:
+                                    overlay_image = None
+                            else:
+                                single_file.seek(0)
+                                overlay_image = Image.open(single_file).convert("RGB")
+                            if overlay_image:
+                                draw = ImageDraw.Draw(overlay_image)
+                                for entity_name, entity_data in raw_predictions.items():
+                                    bboxes = entity_data.get("bbox", [])
+                                    for box in bboxes:
+                                        x, y, w, h = box
+                                        draw.rectangle([x, y, x + w, y + h], outline="red", width=2)
+                                overlay_image.thumbnail((800, 800))
+                                st.image(overlay_image, caption="AI Detection Overlay", width="content")
+                    except Exception as e:
+                        st.error(f"Pipeline error: {e}")
         # -----------------------------
         # Render Results
             st.subheader("🛒 Line Items")
             items = data.get("items", [])
             if items:
+                st.dataframe(pd.DataFrame(items), width='stretch')
             else:
                 st.info("No line items extracted.")
                 mime="application/json"
             )
+            html_report = generate_bulk_html_report([data])
+            st.download_button(
+                "📥 Download HTML Report",
+                html_report,
+                file_name=f"invoice_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.html",
+                mime="text/html"
+            )
             with st.expander("📝 Raw OCR Text"):
                 st.text(data.get("raw_text", "No OCR text available"))

src/ml_extraction.py CHANGED Viewed

@@ -9,6 +9,7 @@ from typing import List, Dict, Any, Tuple
 import re
 import numpy as np
 from src.extraction import extract_invoice_number, extract_total, extract_address
 from doctr.io import DocumentFile
 from doctr.models import ocr_predictor
@@ -155,7 +156,6 @@ def _process_predictions(words, unnormalized_boxes, encoding, predictions, id2la
     return entities
 def extract_ml_based(image_path: str) -> Dict[str, Any]:
     if not MODEL or not PROCESSOR:
         raise RuntimeError("ML model is not loaded.")
@@ -176,7 +176,6 @@ def extract_ml_based(image_path: str) -> Dict[str, Any]:
     # Reconstructs lines so regex can work line-by-line
     lines = []
     current_line = []
     if len(unnormalized_boxes) > 0:
         # Initialize with first word's Y and Height
         current_y = unnormalized_boxes[0][1]
@@ -330,4 +329,11 @@ def extract_ml_based(image_path: str) -> Dict[str, Any]:
                 "bbox": [found_box]
             }
     return final_output

 import re
 import numpy as np
 from src.extraction import extract_invoice_number, extract_total, extract_address
+from src.table_extraction import extract_table_items
 from doctr.io import DocumentFile
 from doctr.models import ocr_predictor
     return entities
 def extract_ml_based(image_path: str) -> Dict[str, Any]:
     if not MODEL or not PROCESSOR:
         raise RuntimeError("ML model is not loaded.")
     # Reconstructs lines so regex can work line-by-line
     lines = []
     current_line = []
     if len(unnormalized_boxes) > 0:
         # Initialize with first word's Y and Height
         current_y = unnormalized_boxes[0][1]
                 "bbox": [found_box]
             }
+    # --- TABLE EXTRACTION (Geometric Heuristic) ---
+    # Use the geometric fallback to extract line items from table region
+    if words and unnormalized_boxes:
+        extracted_items = extract_table_items(words, unnormalized_boxes)
+        if extracted_items:
+            final_output["items"] = extracted_items
     return final_output

src/report_generator.py ADDED Viewed

	@@ -0,0 +1,298 @@

+# src/report_generator.py
+import os
+from datetime import datetime
+def generate_bulk_html_report(results: list, output_path: str = "bulk_report.html"):
+    """
+    Creates a single HTML report summarizing multiple invoices.
+    """
+    # Calculate summary stats
+    total_invoices = len(results)
+    total_value = sum(float(r.get('total_amount') or 0) for r in results)
+    passed_count = sum(1 for r in results if r.get('validation_status') == 'passed')
+    rows_html = ""
+    for idx, res in enumerate(results, 1):
+        # Create a mini-table for the items in this invoice
+        items_list = ""
+        for item in res.get("items", []):
+            total_val = item.get('total', 0)
+            try:
+                total_val = float(total_val)
+                items_list += f"<li>{item.get('description', 'Item')} <span class='item-price'>${total_val:.2f}</span></li>"
+            except:
+                items_list += f"<li>{item.get('description', 'Item')}</li>"
+        if not items_list:
+            items_list = "<li class='no-items'>No items detected</li>"
+        # Format total amount
+        total_amt = res.get('total_amount')
+        try:
+            total_display = f"${float(total_amt):,.2f}" if total_amt else "N/A"
+        except:
+            total_display = str(total_amt) if total_amt else "N/A"
+        status = res.get('validation_status') or 'unknown'
+        rows_html += f"""
+        <tr class="invoice-row">
+            <td class="row-num">{idx}</td>
+            <td class="vendor-cell">{res.get('vendor') or 'Unknown Vendor'}</td>
+            <td>{res.get('date') or 'N/A'}</td>
+            <td>{res.get('receipt_number') or 'N/A'}</td>
+            <td class="total-cell">{total_display}</td>
+            <td><ul class="item-list">{items_list}</ul></td>
+            <td><span class="badge badge-{status}">{status.title()}</span></td>
+        </tr>
+        """
+    html_content = f"""<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Bulk Invoice Report - {datetime.now().strftime('%Y-%m-%d')}</title>
+    <style>
+        * {{ box-sizing: border-box; margin: 0; padding: 0; }}
+        body {{
+            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, sans-serif;
+            background: linear-gradient(135deg, #f5f7fa 0%, #e4e8ec 100%);
+            min-height: 100vh;
+            padding: 40px 20px;
+            color: #333;
+        }}
+        .container {{
+            max-width: 1400px;
+            margin: 0 auto;
+        }}
+        /* Header */
+        .report-header {{
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            color: white;
+            padding: 30px 40px;
+            border-radius: 16px;
+            margin-bottom: 30px;
+            box-shadow: 0 10px 40px rgba(102, 126, 234, 0.3);
+        }}
+        .report-header h1 {{
+            font-size: 2rem;
+            font-weight: 700;
+            margin-bottom: 8px;
+        }}
+        .report-header .subtitle {{
+            opacity: 0.9;
+            font-size: 0.95rem;
+        }}
+        /* Stats Cards */
+        .stats-grid {{
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
+            gap: 20px;
+            margin-bottom: 30px;
+        }}
+        .stat-card {{
+            background: white;
+            padding: 24px;
+            border-radius: 12px;
+            box-shadow: 0 4px 15px rgba(0,0,0,0.08);
+            text-align: center;
+        }}
+        .stat-card .stat-value {{
+            font-size: 2rem;
+            font-weight: 700;
+            color: #667eea;
+        }}
+        .stat-card .stat-label {{
+            font-size: 0.85rem;
+            color: #666;
+            text-transform: uppercase;
+            letter-spacing: 0.5px;
+            margin-top: 4px;
+        }}
+        /* Table */
+        .table-wrapper {{
+            background: white;
+            border-radius: 16px;
+            overflow: hidden;
+            box-shadow: 0 4px 20px rgba(0,0,0,0.1);
+        }}
+        table {{
+            width: 100%;
+            border-collapse: collapse;
+        }}
+        thead th {{
+            background: #2d3748;
+            color: white;
+            padding: 16px 12px;
+            text-align: left;
+            font-weight: 600;
+            font-size: 0.85rem;
+            text-transform: uppercase;
+            letter-spacing: 0.5px;
+        }}
+        tbody td {{
+            padding: 16px 12px;
+            border-bottom: 1px solid #e2e8f0;
+            vertical-align: top;
+        }}
+        tbody tr:nth-child(even) {{
+            background: #f8fafc;
+        }}
+        tbody tr:hover {{
+            background: #edf2f7;
+        }}
+        .row-num {{
+            color: #a0aec0;
+            font-weight: 600;
+            width: 50px;
+        }}
+        .vendor-cell {{
+            font-weight: 600;
+            color: #2d3748;
+        }}
+        .total-cell {{
+            font-weight: 700;
+            color: #38a169;
+            font-size: 1.05rem;
+        }}
+        /* Item List */
+        .item-list {{
+            list-style: none;
+            padding: 0;
+            margin: 0;
+            font-size: 0.85rem;
+        }}
+        .item-list li {{
+            padding: 4px 0;
+            color: #4a5568;
+            border-bottom: 1px dashed #e2e8f0;
+        }}
+        .item-list li:last-child {{
+            border-bottom: none;
+        }}
+        .item-list .item-price {{
+            float: right;
+            color: #667eea;
+            font-weight: 600;
+        }}
+        .item-list .no-items {{
+            color: #a0aec0;
+            font-style: italic;
+        }}
+        /* Badges */
+        .badge {{
+            display: inline-block;
+            padding: 6px 12px;
+            border-radius: 20px;
+            font-size: 0.75rem;
+            font-weight: 600;
+            text-transform: uppercase;
+            letter-spacing: 0.5px;
+        }}
+        .badge-passed {{
+            background: linear-gradient(135deg, #48bb78, #38a169);
+            color: white;
+        }}
+        .badge-failed {{
+            background: linear-gradient(135deg, #fc8181, #e53e3e);
+            color: white;
+        }}
+        .badge-unknown {{
+            background: #e2e8f0;
+            color: #4a5568;
+        }}
+        /* Footer */
+        .report-footer {{
+            text-align: center;
+            margin-top: 40px;
+            color: #718096;
+            font-size: 0.85rem;
+        }}
+        @media print {{
+            body {{ background: white; padding: 0; }}
+            .report-header {{ box-shadow: none; }}
+            .table-wrapper {{ box-shadow: none; }}
+        }}
+    </style>
+</head>
+<body>
+    <div class="container">
+        <header class="report-header">
+            <h1>🧾 Bulk Invoice Extraction Report</h1>
+            <p class="subtitle">Generated on {datetime.now().strftime('%B %d, %Y at %I:%M %p')}</p>
+        </header>
+        <div class="stats-grid">
+            <div class="stat-card">
+                <div class="stat-value">{total_invoices}</div>
+                <div class="stat-label">Total Invoices</div>
+            </div>
+            <div class="stat-card">
+                <div class="stat-value">${total_value:,.2f}</div>
+                <div class="stat-label">Total Value</div>
+            </div>
+            <div class="stat-card">
+                <div class="stat-value">{passed_count}/{total_invoices}</div>
+                <div class="stat-label">Validation Passed</div>
+            </div>
+        </div>
+        <div class="table-wrapper">
+            <table>
+                <thead>
+                    <tr>
+                        <th>#</th>
+                        <th>Vendor</th>
+                        <th>Date</th>
+                        <th>Invoice #</th>
+                        <th>Total</th>
+                        <th>Line Items</th>
+                        <th>Status</th>
+                    </tr>
+                </thead>
+                <tbody>
+                    {rows_html}
+                </tbody>
+            </table>
+        </div>
+        <footer class="report-footer">
+            <p>Generated by Smart Invoice Processor • Powered by LayoutLMv3 + DocTR</p>
+        </footer>
+    </div>
+</body>
+</html>"""
+    return html_content

src/table_extraction.py ADDED Viewed

	@@ -0,0 +1,144 @@

+# src/table_extraction.py
+from typing import List, Dict, Any
+import re
+# Common phrases that indicate NON-item text (should be filtered out)
+EXCLUDE_PHRASES = [
+    "thank you", "thank", "goods sold", "not returnable", "returnable",
+    "shopping at", "visit again", "customer copy", "merchant copy",
+    "powered by", "terms and conditions", "t&c apply", "cashier",
+    "counter", "sdn bhd", "bhd", "pte ltd", "pvt ltd", "llc", "inc",
+    "gst summary", "tax summary", "payment", "change", "cash",
+    "credit card", "debit card", "subtotal", "sub total", "grand total",
+    "total includes", "includes gst", "tax invoice", "invoice"
+]
+def extract_table_items(words: List[str], boxes: List[List[int]]) -> List[Dict[str, Any]]:
+    """
+    Geometric Heuristic to extract table rows.
+    Logic:
+    1. Find 'Header' Y-position (words like 'Description', 'Item', 'Qty').
+    2. Find 'Footer' Y-position (where 'Total' usually sits).
+    3. Filter all words strictly BETWEEN Header and Footer.
+    4. Group remaining words into 'Rows' based on similar Y-coordinates.
+    """
+    if not words or not boxes:
+        return []
+    # 1. Identify Anchor Points
+    header_y = 0
+    footer_y = float('inf')
+    header_keywords = ["description", "item", "particulars", "qty", "quantity", "price", "amount", "rate", "uom", "unit"]
+    footer_keywords = ["total", "subtotal", "tax", "grand total", "payment", "cash", "change", "gst summary", "tax summary"]
+    # Scan for Header (Top boundary)
+    for i, word in enumerate(words):
+        if word.lower() in header_keywords:
+            y_bottom = boxes[i][1] + boxes[i][3]
+            if y_bottom > header_y:
+                header_y = y_bottom
+    # Scan for Footer (Bottom boundary)
+    for i, word in enumerate(words):
+        if word.lower() in footer_keywords:
+            y_top = boxes[i][1]
+            if y_top < footer_y and y_top > header_y:
+                footer_y = y_top
+    # If no header found, assume top 25% is header
+    if header_y == 0 and boxes:
+        max_y = max(b[1] for b in boxes)
+        header_y = max_y * 0.25
+    # If no footer found, assume bottom 25% is footer
+    if footer_y == float('inf') and boxes:
+        max_y = max(b[1] for b in boxes)
+        footer_y = max_y * 0.75
+    # 2. Filter Content (The "Sandwich" Meat)
+    table_words = []
+    for i, word in enumerate(words):
+        bx, by, bw, bh = boxes[i]
+        if by > header_y and (by + bh) < footer_y:
+            table_words.append({"text": word, "box": boxes[i]})
+    # 3. Group by Rows (Y-clustering)
+    rows = []
+    if not table_words:
+        return []
+    table_words.sort(key=lambda x: x["box"][1])
+    current_row = [table_words[0]]
+    current_y = table_words[0]["box"][1]
+    for item in table_words[1:]:
+        y = item["box"][1]
+        if abs(y - current_y) < 15:
+            current_row.append(item)
+        else:
+            current_row.sort(key=lambda x: x["box"][0])
+            rows.append(current_row)
+            current_row = [item]
+            current_y = y
+    if current_row:
+        current_row.sort(key=lambda x: x["box"][0])
+        rows.append(current_row)
+    # 4. Convert Rows to Structured Dicts with FILTERING
+    structured_items = []
+    for row in rows:
+        full_text = " ".join([w["text"] for w in row])
+        full_text_lower = full_text.lower()
+        # Skip rows that match exclude phrases
+        if any(phrase in full_text_lower for phrase in EXCLUDE_PHRASES):
+            continue
+        # Skip very short rows (likely noise)
+        if len(full_text.strip()) < 3:
+            continue
+        # Find all numbers (potential prices)
+        # Match patterns like: 0.90, 12.50, 1,234.56
+        numbers = re.findall(r'\d{1,3}(?:,\d{3})*\.?\d*', full_text)
+        item_obj = {
+            "description": full_text,
+            "quantity": 1,
+            "unit_price": 0.0,
+            "total": 0.0
+        }
+        if numbers:
+            try:
+                # Clean and convert last number as price
+                val = float(numbers[-1].replace(',', ''))
+                # Skip if price is 0 or unreasonably small for a line item
+                if val <= 0:
+                    continue
+                item_obj["total"] = val
+                item_obj["unit_price"] = val
+                # Remove the price from description
+                item_obj["description"] = full_text.replace(numbers[-1], "").strip()
+                # Skip if description is now empty or too short
+                if len(item_obj["description"].strip()) < 2:
+                    continue
+            except:
+                continue
+        else:
+            # No numbers found = not a valid line item
+            continue
+        structured_items.append(item_obj)
+    return structured_items