GSoumyajit2005 commited on
Commit
e81f779
·
1 Parent(s): 26b5b24

docs: update README for DocTR migration and project structure

Browse files
Files changed (1) hide show
  1. README.md +32 -34
README.md CHANGED
@@ -15,7 +15,7 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
15
 
16
  ![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)
17
  ![Streamlit](https://img.shields.io/badge/Streamlit-1.51+-red.svg)
18
- ![Tesseract](https://img.shields.io/badge/Tesseract-5.0+-green.svg)
19
  ![Transformers](https://img.shields.io/badge/Transformers-4.x-purple.svg)
20
  ![PyTorch](https://img.shields.io/badge/PyTorch-2.x-orange.svg)
21
 
@@ -44,7 +44,7 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
44
  ### 🛡️ Robustness & Engineering
45
 
46
  - **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
47
- - **Cross-Platform OCR:** Dynamic Tesseract path discovery that works out-of-the-box on Windows (Local) and Linux (Docker/Production).
48
  - **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
49
 
50
  ### 💻 Usability
@@ -69,7 +69,7 @@ Standard ML models often fail on specific fields like "Invoice Number" if the la
69
 
70
  ### 2. Robustness & Error Handling
71
 
72
- - **OCR Noise:** Handles common Tesseract errors (e.g., reading "1nvoice" as "Invoice").
73
  - **Coordinate Normalization:** A custom `clamp()` function ensures all bounding boxes stay strictly within [0, 1000] to prevent Transformer index errors.
74
 
75
  ### 3. Dual-Engine Architecture
@@ -190,15 +190,7 @@ python -m venv venv
190
  pip install -r requirements.txt
191
  ```
192
 
193
- 4. Install Tesseract OCR
194
-
195
- - **Windows**: Download from [UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki)
196
- - **Mac**: `brew install tesseract`
197
- - **Linux**: `sudo apt install tesseract-ocr`
198
-
199
- 5. Tesseract Configuration (Auto-Detected) The system automatically detects Tesseract on both Windows (Registry/Standard Paths) and Linux (`/usr/bin/tesseract`). No manual configuration is required in `src/ocr.py`.
200
-
201
- 6. Run the web app
202
 
203
  ```bash
204
  streamlit run app.py
@@ -282,7 +274,7 @@ print(json.dumps(result, indent=2))
282
 
283
 
284
  ┌───────────────┐
285
- │ OCR │ (Tesseract)
286
  └───────┬───────┘
287
 
288
  ┌──────────────┴──────────────┐
@@ -321,31 +313,38 @@ invoice-processor-ml/
321
  │ └── screenshots/ # UI Screenshots for the README demo
322
 
323
  ├── models/
324
- │ └── layoutlmv3-sroie-generalized/ # Fine-tuned model (created after training)
325
 
326
  ├── outputs/ # Default folder for saved JSON results
327
 
328
  ├── scripts/ # Training and analysis scripts
329
- │ ├── train_combined.py # Main training loop (SROIE + Custom Data)
330
  │ ├── eval_new_dataset.py # Evaluation scripts
331
- ── explore_new_dataset.py # Dataset exploration tools
 
 
 
332
 
333
  ├── src/
334
- │ ├── sroie_loader.py # SROIE dataset loading logic
335
  │ ├── data_loader.py # Unified data loader for training
336
- │ ├── preprocessing.py # Image preprocessing functions (grayscale, denoise)
337
- │ ├── ocr.py # Tesseract OCR integration
338
- │ ├── extraction.py # Regex-based information extraction logic
339
- │ ├── ml_extraction.py # ML-based extraction (LayoutLMv3)
340
- │ ├── pipeline.py # Main orchestrator for the pipeline and CLI
341
  │ ├── database.py # PostgreSQL connection (scaffolded)
 
 
342
  │ ├── models.py # SQLModel tables for persistence (scaffolded)
343
- ── repository.py # CRUD operations for invoices (scaffolded)
 
 
 
 
 
 
344
 
345
  ├── tests/
346
- │ ├── test_preprocessing.py # Tests for the preprocessing module
 
347
  │ ├── test_ocr.py # Tests for the OCR module
348
- ── test_pipeline.py # End-to-end pipeline tests
 
349
 
350
  ├── app.py # Streamlit web interface
351
  ├── requirements.txt # Python dependencies
@@ -358,26 +357,25 @@ invoice-processor-ml/
358
  - **Task**: Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
359
  - **Dataset**: SROIE (ICDAR 2019, English retail receipts), mychen76/invoices-and-receipts_ocr_v1 (English)
360
  - **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
361
- - **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)
362
 
363
  - Training scripts (local):
364
  - `scripts/train_combined.py` (data prep, training loop with validation + model save)
365
- - Model saved to: `models/layoutlmv3-sroie-generalized/`
366
 
367
  ## 📈 Performance
368
 
369
- - **OCR accuracy (clear images)**: High with Tesseract
370
- - **Rule-based extraction**: Strong on simple retail receipts
371
- - **ML-based extraction (SROIE-style)**:
372
- - COMPANY / ADDRESS / DATE / TOTAL: High F1 on simple receipts
373
- - Complex business invoices: Partial extraction unless further fine-tuned
374
 
375
  ## ⚠️ Known Limitations
376
 
377
  1. **Layout Sensitivity**: The ML model was fine‑tuned on SROIE (retail receipts) and mychen76/invoices-and-receipts_ocr_v1 (English). Professional multi-column invoices may underperform until you fine‑tune on more diverse datasets.
378
  2. **Invoice Number**: SROIE dataset lacks invoice number labels. The system solves this by using the Hybrid Fallback Engine, which successfully extracts invoice numbers using Regex whenever the ML model output is empty.
379
  3. **Line Items/Tables**: Not trained for table extraction yet. Rule-based supports simple totals; table extraction comes later.
380
- 4. **OCR Variability**: Tesseract outputs can vary; preprocessing and thresholds can impact ML results.
381
 
382
  ## 🔮 Future Enhancements
383
 
@@ -396,7 +394,7 @@ invoice-processor-ml/
396
 
397
  | Component | Technology |
398
  | ---------------- | ----------------------------------- |
399
- | OCR | Tesseract 5.0+ |
400
  | Image Processing | OpenCV, Pillow |
401
  | ML/NLP | PyTorch 2.x, Transformers |
402
  | Model | LayoutLMv3 (token class.) |
 
15
 
16
  ![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)
17
  ![Streamlit](https://img.shields.io/badge/Streamlit-1.51+-red.svg)
18
+ ![DocTR](https://img.shields.io/badge/DocTR-0.9+-green.svg)
19
  ![Transformers](https://img.shields.io/badge/Transformers-4.x-purple.svg)
20
  ![PyTorch](https://img.shields.io/badge/PyTorch-2.x-orange.svg)
21
 
 
44
  ### 🛡️ Robustness & Engineering
45
 
46
  - **Defensive Data Handling:** Implemented coordinate clamping to prevent model crashes from negative OCR bounding boxes.
47
+ - **GPU-Accelerated OCR:** DocTR (Mindee) with automatic CUDA acceleration for faster inference in production.
48
  - **Clean JSON Output:** Normalized schema handling nested entities, line items, and validation flags.
49
 
50
  ### 💻 Usability
 
69
 
70
  ### 2. Robustness & Error Handling
71
 
72
+ - **OCR Noise:** Uses DocTR's deep learning-based text recognition for improved accuracy over traditional OCR.
73
  - **Coordinate Normalization:** A custom `clamp()` function ensures all bounding boxes stay strictly within [0, 1000] to prevent Transformer index errors.
74
 
75
  ### 3. Dual-Engine Architecture
 
190
  pip install -r requirements.txt
191
  ```
192
 
193
+ 4. Run the web app
 
 
 
 
 
 
 
 
194
 
195
  ```bash
196
  streamlit run app.py
 
274
 
275
 
276
  ┌───────────────┐
277
+ │ OCR │ (DocTR)
278
  └───────┬───────┘
279
 
280
  ┌──────────────┴──────────────┐
 
313
  │ └── screenshots/ # UI Screenshots for the README demo
314
 
315
  ├── models/
316
+ │ └── layoutlmv3-doctr-trained/ # Fine-tuned model (trained with DocTR OCR)
317
 
318
  ├── outputs/ # Default folder for saved JSON results
319
 
320
  ├── scripts/ # Training and analysis scripts
 
321
  │ ├── eval_new_dataset.py # Evaluation scripts
322
+ ── explore_new_dataset.py # Dataset exploration tools
323
+ │ ├── prepare_doctr_data.py # DocTR data alignment for training
324
+ │ ├── train_combined.py # Main training loop (SROIE + Custom Data)
325
+ │ └── train_layoutlm.py # LayoutLMv3 fine-tuning script
326
 
327
  ├── src/
328
+ │ ├── api.py # FastAPI REST endpoint for API access
329
  │ ├── data_loader.py # Unified data loader for training
 
 
 
 
 
330
  │ ├── database.py # PostgreSQL connection (scaffolded)
331
+ │ ├── extraction.py # Regex-based information extraction logic
332
+ │ ├── ml_extraction.py # ML-based extraction (LayoutLMv3 + DocTR)
333
  │ ├── models.py # SQLModel tables for persistence (scaffolded)
334
+ ── pdf_utils.py # PDF text extraction and image conversion
335
+ │ ├── pipeline.py # Main orchestrator for the pipeline and CLI
336
+ │ ├── preprocessing.py # Image preprocessing functions (grayscale, denoise)
337
+ │ ├── repository.py # CRUD operations for invoices (scaffolded)
338
+ │ ├── schema.py # Pydantic models for API response validation
339
+ │ ├── sroie_loader.py # SROIE dataset loading logic
340
+ │ └── utils.py # Utility functions (semantic hashing, etc.)
341
 
342
  ├── tests/
343
+ │ ├── test_extraction.py # Tests for regex extraction module
344
+ │ ├── test_full_pipeline.py # Full end-to-end integration tests
345
  │ ├── test_ocr.py # Tests for the OCR module
346
+ ── test_pipeline.py # Pipeline process tests
347
+ │ └── test_preprocessing.py # Tests for the preprocessing module
348
 
349
  ├── app.py # Streamlit web interface
350
  ├── requirements.txt # Python dependencies
 
357
  - **Task**: Token Classification (NER) with 9 labels: `O, B/I-COMPANY, B/I-ADDRESS, B/I-DATE, B/I-TOTAL`
358
  - **Dataset**: SROIE (ICDAR 2019, English retail receipts), mychen76/invoices-and-receipts_ocr_v1 (English)
359
  - **Training**: RTX 3050 6GB, PyTorch 2.x, Transformers 4.x
360
+ - **Result**: F1 Score ≈ 0.83 (Real-world performance on DocTR-aligned validation set)
361
 
362
  - Training scripts (local):
363
  - `scripts/train_combined.py` (data prep, training loop with validation + model save)
364
+ - Model saved to: `models/layoutlmv3-doctr-trained/`
365
 
366
  ## 📈 Performance
367
 
368
+ - **OCR Precision**: State-of-the-art hierarchical detection using **DocTR (ResNet-50)**. Outperforms Tesseract on complex/noisy layouts.
369
+ - **ML-based Extraction**:
370
+ - **Accuracy**: ~83% F1 Score on SROIE + Custom Dataset.
371
+ - **Speed**: ~5-7s per invoice (CPU) / <1s (GPU). Prioritizes high precision over raw speed.
 
372
 
373
  ## ⚠️ Known Limitations
374
 
375
  1. **Layout Sensitivity**: The ML model was fine‑tuned on SROIE (retail receipts) and mychen76/invoices-and-receipts_ocr_v1 (English). Professional multi-column invoices may underperform until you fine‑tune on more diverse datasets.
376
  2. **Invoice Number**: SROIE dataset lacks invoice number labels. The system solves this by using the Hybrid Fallback Engine, which successfully extracts invoice numbers using Regex whenever the ML model output is empty.
377
  3. **Line Items/Tables**: Not trained for table extraction yet. Rule-based supports simple totals; table extraction comes later.
378
+ 4. **Inference Latency**: Using the heavy **DocTR (ResNet-50)** backbone ensures maximum accuracy but results in higher inference time (~5-7s on CPU) compared to lightweight engines.
379
 
380
  ## 🔮 Future Enhancements
381
 
 
394
 
395
  | Component | Technology |
396
  | ---------------- | ----------------------------------- |
397
+ | OCR | DocTR (Mindee) |
398
  | Image Processing | OpenCV, Pillow |
399
  | ML/NLP | PyTorch 2.x, Transformers |
400
  | Model | LayoutLMv3 (token class.) |