Spaces:
Sleeping
Sleeping
Commit ·
e81f779
1
Parent(s): 26b5b24
docs: update README for DocTR migration and project structure
Browse files
README.md
CHANGED
|
@@ -15,7 +15,7 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
|
|
| 15 |
|
| 16 |

|
| 17 |

|
| 18 |
-

|
| 20 |

|
| 21 |
|
|
@@ -44,7 +44,7 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
|
|
| 44 |
### 🛡️ Robustness & Engineering
|
| 45 |
|
| 46 |
- **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
|
| 47 |
-
- **
|
| 48 |
- **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
|
| 49 |
|
| 50 |
### 💻 Usability
|
|
@@ -69,7 +69,7 @@ Standard ML models often fail on specific fields like "Invoice Number" if the la
|
|
| 69 |
|
| 70 |
### 2. Robustness & Error Handling
|
| 71 |
|
| 72 |
-
- **OCR Noise:**
|
| 73 |
- **Coordinate Normalization:** A custom `clamp()` function ensures all bounding boxes stay strictly within [0, 1000] to prevent Transformer index errors.
|
| 74 |
|
| 75 |
### 3. Dual-Engine Architecture
|
|
@@ -190,15 +190,7 @@ python -m venv venv
|
|
| 190 |
pip install -r requirements.txt
|
| 191 |
```
|
| 192 |
|
| 193 |
-
4.
|
| 194 |
-
|
| 195 |
-
- **Windows**: Download from [UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki)
|
| 196 |
-
- **Mac**: `brew install tesseract`
|
| 197 |
-
- **Linux**: `sudo apt install tesseract-ocr`
|
| 198 |
-
|
| 199 |
-
5. Tesseract Configuration (Auto-Detected) The system automatically detects Tesseract on both Windows (Registry/Standard Paths) and Linux (`/usr/bin/tesseract`). No manual configuration is required in `src/ocr.py`.
|
| 200 |
-
|
| 201 |
-
6. Run the web app
|
| 202 |
|
| 203 |
```bash
|
| 204 |
streamlit run app.py
|
|
@@ -282,7 +274,7 @@ print(json.dumps(result, indent=2))
|
|
| 282 |
│
|
| 283 |
▼
|
| 284 |
┌───────────────┐
|
| 285 |
-
│ OCR │ (
|
| 286 |
└───────┬───────┘
|
| 287 |
│
|
| 288 |
┌──────────────┴──────────────┐
|
|
@@ -321,31 +313,38 @@ invoice-processor-ml/
|
|
| 321 |
│ └── screenshots/ # UI Screenshots for the README demo
|
| 322 |
│
|
| 323 |
├── models/
|
| 324 |
-
│ └── layoutlmv3-
|
| 325 |
│
|
| 326 |
├── outputs/ # Default folder for saved JSON results
|
| 327 |
│
|
| 328 |
├── scripts/ # Training and analysis scripts
|
| 329 |
-
│ ├── train_combined.py # Main training loop (SROIE + Custom Data)
|
| 330 |
│ ├── eval_new_dataset.py # Evaluation scripts
|
| 331 |
-
│
|
|
|
|
|
|
|
|
|
|
| 332 |
│
|
| 333 |
├── src/
|
| 334 |
-
│ ├──
|
| 335 |
│ ├── data_loader.py # Unified data loader for training
|
| 336 |
-
│ ├── preprocessing.py # Image preprocessing functions (grayscale, denoise)
|
| 337 |
-
│ ├── ocr.py # Tesseract OCR integration
|
| 338 |
-
│ ├── extraction.py # Regex-based information extraction logic
|
| 339 |
-
│ ├── ml_extraction.py # ML-based extraction (LayoutLMv3)
|
| 340 |
-
│ ├── pipeline.py # Main orchestrator for the pipeline and CLI
|
| 341 |
│ ├── database.py # PostgreSQL connection (scaffolded)
|
|
|
|
|
|
|
| 342 |
│ ├── models.py # SQLModel tables for persistence (scaffolded)
|
| 343 |
-
│
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 344 |
│
|
| 345 |
├── tests/
|
| 346 |
-
│ ├──
|
|
|
|
| 347 |
│ ├── test_ocr.py # Tests for the OCR module
|
| 348 |
-
│
|
|
|
|
| 349 |
│
|
| 350 |
├── app.py # Streamlit web interface
|
| 351 |
├── requirements.txt # Python dependencies
|
|
@@ -358,26 +357,25 @@ invoice-processor-ml/
|
|
| 358 |
- **Task**: Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
|
| 359 |
- **Dataset**: SROIE (ICDAR 2019, English retail receipts), mychen76/invoices-and-receipts_ocr_v1 (English)
|
| 360 |
- **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
|
| 361 |
-
- **Result**:
|
| 362 |
|
| 363 |
- Training scripts (local):
|
| 364 |
- `scripts/train_combined.py` (data prep, training loop with validation + model save)
|
| 365 |
-
- Model saved to: `models/layoutlmv3-
|
| 366 |
|
| 367 |
## 📈 Performance
|
| 368 |
|
| 369 |
-
- **OCR
|
| 370 |
-
- **
|
| 371 |
-
- **
|
| 372 |
-
-
|
| 373 |
-
- Complex business invoices: Partial extraction unless further fine-tuned
|
| 374 |
|
| 375 |
## ⚠️ Known Limitations
|
| 376 |
|
| 377 |
1. **Layout Sensitivity**: The ML model was fine‑tuned on SROIE (retail receipts) and mychen76/invoices-and-receipts_ocr_v1 (English). Professional multi-column invoices may underperform until you fine‑tune on more diverse datasets.
|
| 378 |
2. **Invoice Number**: SROIE dataset lacks invoice number labels. The system solves this by using the Hybrid Fallback Engine, which successfully extracts invoice numbers using Regex whenever the ML model output is empty.
|
| 379 |
3. **Line Items/Tables**: Not trained for table extraction yet. Rule-based supports simple totals; table extraction comes later.
|
| 380 |
-
4. **
|
| 381 |
|
| 382 |
## 🔮 Future Enhancements
|
| 383 |
|
|
@@ -396,7 +394,7 @@ invoice-processor-ml/
|
|
| 396 |
|
| 397 |
| Component | Technology |
|
| 398 |
| ---------------- | ----------------------------------- |
|
| 399 |
-
| OCR |
|
| 400 |
| Image Processing | OpenCV, Pillow |
|
| 401 |
| ML/NLP | PyTorch 2.x, Transformers |
|
| 402 |
| Model | LayoutLMv3 (token class.) |
|
|
|
|
| 15 |
|
| 16 |

|
| 17 |

|
| 18 |
+

|
| 19 |

|
| 20 |

|
| 21 |
|
|
|
|
| 44 |
### 🛡️ Robustness & Engineering
|
| 45 |
|
| 46 |
- **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
|
| 47 |
+
- **GPU-Accelerated OCR:** DocTR (Mindee) with automatic CUDA acceleration for faster inference in production.
|
| 48 |
- **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
|
| 49 |
|
| 50 |
### 💻 Usability
|
|
|
|
| 69 |
|
| 70 |
### 2. Robustness & Error Handling
|
| 71 |
|
| 72 |
+
- **OCR Noise:** Uses DocTR's deep learning-based text recognition for improved accuracy over traditional OCR.
|
| 73 |
- **Coordinate Normalization:** A custom `clamp()` function ensures all bounding boxes stay strictly within [0, 1000] to prevent Transformer index errors.
|
| 74 |
|
| 75 |
### 3. Dual-Engine Architecture
|
|
|
|
| 190 |
pip install -r requirements.txt
|
| 191 |
```
|
| 192 |
|
| 193 |
+
4. Run the web app
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 194 |
|
| 195 |
```bash
|
| 196 |
streamlit run app.py
|
|
|
|
| 274 |
│
|
| 275 |
▼
|
| 276 |
┌───────────────┐
|
| 277 |
+
│ OCR │ (DocTR)
|
| 278 |
└───────┬───────┘
|
| 279 |
│
|
| 280 |
┌──────────────┴──────────────┐
|
|
|
|
| 313 |
│ └── screenshots/ # UI Screenshots for the README demo
|
| 314 |
│
|
| 315 |
├── models/
|
| 316 |
+
│ └── layoutlmv3-doctr-trained/ # Fine-tuned model (trained with DocTR OCR)
|
| 317 |
│
|
| 318 |
├── outputs/ # Default folder for saved JSON results
|
| 319 |
│
|
| 320 |
├── scripts/ # Training and analysis scripts
|
|
|
|
| 321 |
│ ├── eval_new_dataset.py # Evaluation scripts
|
| 322 |
+
│ ├── explore_new_dataset.py # Dataset exploration tools
|
| 323 |
+
│ ├── prepare_doctr_data.py # DocTR data alignment for training
|
| 324 |
+
│ ├── train_combined.py # Main training loop (SROIE + Custom Data)
|
| 325 |
+
│ └── train_layoutlm.py # LayoutLMv3 fine-tuning script
|
| 326 |
│
|
| 327 |
├── src/
|
| 328 |
+
│ ├── api.py # FastAPI REST endpoint for API access
|
| 329 |
│ ├── data_loader.py # Unified data loader for training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 330 |
│ ├── database.py # PostgreSQL connection (scaffolded)
|
| 331 |
+
│ ├── extraction.py # Regex-based information extraction logic
|
| 332 |
+
│ ├── ml_extraction.py # ML-based extraction (LayoutLMv3 + DocTR)
|
| 333 |
│ ├── models.py # SQLModel tables for persistence (scaffolded)
|
| 334 |
+
│ ├── pdf_utils.py # PDF text extraction and image conversion
|
| 335 |
+
│ ├── pipeline.py # Main orchestrator for the pipeline and CLI
|
| 336 |
+
│ ├── preprocessing.py # Image preprocessing functions (grayscale, denoise)
|
| 337 |
+
│ ├── repository.py # CRUD operations for invoices (scaffolded)
|
| 338 |
+
│ ├── schema.py # Pydantic models for API response validation
|
| 339 |
+
│ ├── sroie_loader.py # SROIE dataset loading logic
|
| 340 |
+
│ └── utils.py # Utility functions (semantic hashing, etc.)
|
| 341 |
│
|
| 342 |
├── tests/
|
| 343 |
+
│ ├── test_extraction.py # Tests for regex extraction module
|
| 344 |
+
│ ├── test_full_pipeline.py # Full end-to-end integration tests
|
| 345 |
│ ├── test_ocr.py # Tests for the OCR module
|
| 346 |
+
│ ├── test_pipeline.py # Pipeline process tests
|
| 347 |
+
│ └── test_preprocessing.py # Tests for the preprocessing module
|
| 348 |
│
|
| 349 |
├── app.py # Streamlit web interface
|
| 350 |
├── requirements.txt # Python dependencies
|
|
|
|
| 357 |
- **Task**: Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
|
| 358 |
- **Dataset**: SROIE (ICDAR 2019, English retail receipts), mychen76/invoices-and-receipts_ocr_v1 (English)
|
| 359 |
- **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
|
| 360 |
+
- **Result**: F1 Score ≈ 0.83 (Real-world performance on DocTR-aligned validation set)
|
| 361 |
|
| 362 |
- Training scripts (local):
|
| 363 |
- `scripts/train_combined.py` (data prep, training loop with validation + model save)
|
| 364 |
+
- Model saved to: `models/layoutlmv3-doctr-trained/`
|
| 365 |
|
| 366 |
## 📈 Performance
|
| 367 |
|
| 368 |
+
- **OCR Precision**: State-of-the-art hierarchical detection using **DocTR (ResNet-50)**. Outperforms Tesseract on complex/noisy layouts.
|
| 369 |
+
- **ML-based Extraction**:
|
| 370 |
+
- **Accuracy**: ~83% F1 Score on SROIE + Custom Dataset.
|
| 371 |
+
- **Speed**: ~5-7s per invoice (CPU) / <1s (GPU). Prioritizes high precision over raw speed.
|
|
|
|
| 372 |
|
| 373 |
## ⚠️ Known Limitations
|
| 374 |
|
| 375 |
1. **Layout Sensitivity**: The ML model was fine‑tuned on SROIE (retail receipts) and mychen76/invoices-and-receipts_ocr_v1 (English). Professional multi-column invoices may underperform until you fine‑tune on more diverse datasets.
|
| 376 |
2. **Invoice Number**: SROIE dataset lacks invoice number labels. The system solves this by using the Hybrid Fallback Engine, which successfully extracts invoice numbers using Regex whenever the ML model output is empty.
|
| 377 |
3. **Line Items/Tables**: Not trained for table extraction yet. Rule-based supports simple totals; table extraction comes later.
|
| 378 |
+
4. **Inference Latency**: Using the heavy **DocTR (ResNet-50)** backbone ensures maximum accuracy but results in higher inference time (~5-7s on CPU) compared to lightweight engines.
|
| 379 |
|
| 380 |
## 🔮 Future Enhancements
|
| 381 |
|
|
|
|
| 394 |
|
| 395 |
| Component | Technology |
|
| 396 |
| ---------------- | ----------------------------------- |
|
| 397 |
+
| OCR | DocTR (Mindee) |
|
| 398 |
| Image Processing | OpenCV, Pillow |
|
| 399 |
| ML/NLP | PyTorch 2.x, Transformers |
|
| 400 |
| Model | LayoutLMv3 (token class.) |
|