Spaces:

GSoumyajit2005
/

invoice-processor-ml

Sleeping

GSoumyajit2005 commited on Dec 13, 2025

Commit

aa4f954

1 Parent(s): 31af52a

feat: Add .dockerignore, enhance UI to display receipt number and robustly handle bill-to, and update README with an additional dataset.

Files changed (4) hide show

.dockerignore ADDED Viewed

+# Git
+.git
+.gitignore
+# Python
+__pycache__
+*.py[cod]
+*.pyo
+*.pyd
+.Python
+*.so
+.eggs
+*.egg-info
+.mypy_cache
+.pytest_cache
+# Virtual environments
+venv
+.venv
+env
+# IDE
+.vscode
+.idea
+*.swp
+*.swo
+# Data and outputs (large files)
+data/
+outputs/
+temp/
+# Tests (not needed in production)
+tests/
+# Documentation
+docs/
+*.md
+!README.md
+# Jupyter notebooks
+*.ipynb
+.ipynb_checkpoints
+# Docker
+Dockerfile
+docker-compose*.yml
+.dockerignore
+# Misc
+.env
+.env.*
+*.log

README.md CHANGED Viewed

@@ -278,7 +278,7 @@ invoice-processor-ml/
 │   └── pipeline.py             # Main orchestrator for the pipeline and CLI
 │
 │
-├── tests/ # <-- ADD THIS FOLDER
 │ ├── test_preprocessing.py       # Tests for the preprocessing module
 │ ├── test_ocr.py                 # Tests for the OCR module
 │ └── test_pipeline.py            # End-to-end pipeline tests
@@ -292,7 +292,7 @@ invoice-processor-ml/
 - **Model**: `microsoft/layoutlmv3-base` (125M params)
 - **Task**:  Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
-- **Dataset**: SROIE (ICDAR 2019, English retail receipts)
 - **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
 - **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)

 │   └── pipeline.py             # Main orchestrator for the pipeline and CLI
 │
 │
+├── tests/
 │ ├── test_preprocessing.py       # Tests for the preprocessing module
 │ ├── test_ocr.py                 # Tests for the OCR module
 │ └── test_pipeline.py            # End-to-end pipeline tests
 - **Model**: `microsoft/layoutlmv3-base` (125M params)
 - **Task**:  Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
+- **Dataset**: SROIE (ICDAR 2019, English retail receipts), mychen76/invoices-and-receipts_ocr_v1 (English)
 - **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
 - **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)

app.py CHANGED Viewed

@@ -228,7 +228,19 @@ with tab1:
             # Use an expander for longer text fields like address
             with st.expander("Show More Details"):
-                st.markdown(f"**👤 Bill To:** {data.get('bill_to', {}).get('name') if data.get('bill_to') else 'N/A'}")
                 st.markdown(f"**📍 Vendor Address:** {data.get('address') or 'N/A'}")
             # Line items table

             # Use an expander for longer text fields like address
             with st.expander("Show More Details"):
+                # Handle receipt_number
+                st.markdown(f"**🧾 Receipt Number:** {data.get('receipt_number') or 'N/A'}")
+                # Handle bill_to (can be string from ML or dict from rules)
+                bill_to = data.get('bill_to')
+                if isinstance(bill_to, dict):
+                    bill_to_display = bill_to.get('name') or 'N/A'
+                elif isinstance(bill_to, str):
+                    bill_to_display = bill_to
+                else:
+                    bill_to_display = 'N/A'
+                st.markdown(f"**👤 Bill To:** {bill_to_display}")
                 st.markdown(f"**📍 Vendor Address:** {data.get('address') or 'N/A'}")
             # Line items table

src/extraction.py CHANGED Viewed

@@ -138,7 +138,6 @@ def extract_bill_to(text: str) -> Optional[Dict[str, str]]:
     return None
 def extract_line_items(text: str) -> List[Dict[str, Any]]:
-    # (Keeping your existing logic simple for now)
     return []
 def structure_output(text: str) -> Dict[str, Any]:

     return None
 def extract_line_items(text: str) -> List[Dict[str, Any]]:
     return []
 def structure_output(text: str) -> Dict[str, Any]: