Spaces:
Sleeping
Sleeping
Commit ·
1144bea
1
Parent(s): 4768ab6
docs: Update README to reflect new project structure
Browse files
README.md
CHANGED
|
@@ -28,6 +28,7 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
|
|
| 28 |
- **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.
|
| 29 |
|
| 30 |
> Note on Invoice Numbers: The SROIE dataset used for training does not include "Invoice Number" labels. To solve this, the system uses a Hybrid Fallback Mechanism: if the ML model (LayoutLMv3) returns null for the Invoice Number, the system automatically triggers a targeted Regex extraction to ensure this critical field is captured.
|
|
|
|
| 31 |
---
|
| 32 |
|
| 33 |
## 🛠️ Technical Deep Dive (Why this architecture?)
|
|
@@ -60,6 +61,7 @@ The system outputs a clean JSON with the following fields:
|
|
| 60 |
- `total_amount`: The total amount (extracted by LayoutLMv3 or Regex).
|
| 61 |
- `extraction_confidence`: The confidence of the extraction (0-100).
|
| 62 |
- `validation_passed`: Whether the validation passed (true/false).
|
|
|
|
| 63 |
---
|
| 64 |
|
| 65 |
## 📊 Demo
|
|
@@ -93,6 +95,7 @@ The system outputs a clean JSON with the following fields:
|
|
| 93 |
"address": "NO JALAN BAYU 4, BANDAR SERI ALAM, 81750 MASAI, JOHOR"
|
| 94 |
}
|
| 95 |
```
|
|
|
|
| 96 |
### Example JSON (ML-based)
|
| 97 |
```json
|
| 98 |
{
|
|
@@ -141,6 +144,7 @@ source venv/bin/activate
|
|
| 141 |
python -m venv venv
|
| 142 |
.\venv\Scripts\activate
|
| 143 |
```
|
|
|
|
| 144 |
3. Install dependencies
|
| 145 |
```bash
|
| 146 |
pip install -r requirements.txt
|
|
@@ -158,6 +162,14 @@ pip install -r requirements.txt
|
|
| 158 |
streamlit run app.py
|
| 159 |
```
|
| 160 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
## 💻 Usage
|
| 162 |
|
| 163 |
### Web Interface (Recommended)
|
|
@@ -257,31 +269,35 @@ invoice-processor-ml/
|
|
| 257 |
│ ├── raw/ # Input invoice images for processing
|
| 258 |
│ └── processed/ # (Reserved for future use)
|
| 259 |
│
|
| 260 |
-
│
|
| 261 |
├── data/samples/
|
| 262 |
│ └── sample_invoice.jpg # Public sample for quick testing
|
| 263 |
│
|
| 264 |
├── docs/
|
| 265 |
-
│
|
| 266 |
-
│
|
| 267 |
│
|
| 268 |
├── models/
|
| 269 |
-
│ └── layoutlmv3-sroie-
|
| 270 |
│
|
| 271 |
├── outputs/ # Default folder for saved JSON results
|
| 272 |
│
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 273 |
├── src/
|
|
|
|
|
|
|
| 274 |
│ ├── preprocessing.py # Image preprocessing functions (grayscale, denoise)
|
| 275 |
│ ├── ocr.py # Tesseract OCR integration
|
| 276 |
│ ├── extraction.py # Regex-based information extraction logic
|
| 277 |
│ ├── ml_extraction.py # ML-based extraction (LayoutLMv3)
|
| 278 |
│ └── pipeline.py # Main orchestrator for the pipeline and CLI
|
| 279 |
│
|
| 280 |
-
|
| 281 |
-
├──
|
| 282 |
-
│
|
| 283 |
-
│
|
| 284 |
-
│ └── test_pipeline.py # End-to-end pipeline tests
|
| 285 |
│
|
| 286 |
├── app.py # Streamlit web interface
|
| 287 |
├── requirements.txt # Python dependencies
|
|
@@ -297,8 +313,8 @@ invoice-processor-ml/
|
|
| 297 |
- **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)
|
| 298 |
|
| 299 |
- Training scripts (local):
|
| 300 |
-
- `
|
| 301 |
-
- Model saved to: `models/layoutlmv3-sroie-
|
| 302 |
|
| 303 |
## 📈 Performance
|
| 304 |
|
|
@@ -317,12 +333,12 @@ invoice-processor-ml/
|
|
| 317 |
|
| 318 |
## 🔮 Future Enhancements
|
| 319 |
|
| 320 |
-
- [
|
| 321 |
- [ ] (Optional) Add FATURA (table-focused) for line-item extraction
|
| 322 |
- [ ] Sliding-window chunking for >512 token documents (to avoid truncation)
|
| 323 |
- [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
|
| 324 |
- [ ] PDF support (pdf2image) for multipage invoices
|
| 325 |
-
- [
|
| 326 |
- [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
|
| 327 |
- [ ] Confidence calibration and better validation rules
|
| 328 |
|
|
@@ -362,7 +378,7 @@ MIT License - See LICENSE file for details
|
|
| 362 |
|
| 363 |
**Soumyajit Ghosh** - 3rd Year BTech Student
|
| 364 |
- Exploring AI/ML and practical applications
|
| 365 |
-
- [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-
|
| 366 |
|
| 367 |
---
|
| 368 |
|
|
|
|
| 28 |
- **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.
|
| 29 |
|
| 30 |
> Note on Invoice Numbers: The SROIE dataset used for training does not include "Invoice Number" labels. To solve this, the system uses a Hybrid Fallback Mechanism: if the ML model (LayoutLMv3) returns null for the Invoice Number, the system automatically triggers a targeted Regex extraction to ensure this critical field is captured.
|
| 31 |
+
|
| 32 |
---
|
| 33 |
|
| 34 |
## 🛠️ Technical Deep Dive (Why this architecture?)
|
|
|
|
| 61 |
- `total_amount`: The total amount (extracted by LayoutLMv3 or Regex).
|
| 62 |
- `extraction_confidence`: The confidence of the extraction (0-100).
|
| 63 |
- `validation_passed`: Whether the validation passed (true/false).
|
| 64 |
+
|
| 65 |
---
|
| 66 |
|
| 67 |
## 📊 Demo
|
|
|
|
| 95 |
"address": "NO JALAN BAYU 4, BANDAR SERI ALAM, 81750 MASAI, JOHOR"
|
| 96 |
}
|
| 97 |
```
|
| 98 |
+
|
| 99 |
### Example JSON (ML-based)
|
| 100 |
```json
|
| 101 |
{
|
|
|
|
| 144 |
python -m venv venv
|
| 145 |
.\venv\Scripts\activate
|
| 146 |
```
|
| 147 |
+
|
| 148 |
3. Install dependencies
|
| 149 |
```bash
|
| 150 |
pip install -r requirements.txt
|
|
|
|
| 162 |
streamlit run app.py
|
| 163 |
```
|
| 164 |
|
| 165 |
+
### Training the Model (Optional)
|
| 166 |
+
To retrain the model from scratch using the provided scripts:
|
| 167 |
+
|
| 168 |
+
```bash
|
| 169 |
+
python scripts/train_combined.py
|
| 170 |
+
```
|
| 171 |
+
(Note: Requires SROIE dataset in data/sroie)
|
| 172 |
+
|
| 173 |
## 💻 Usage
|
| 174 |
|
| 175 |
### Web Interface (Recommended)
|
|
|
|
| 269 |
│ ├── raw/ # Input invoice images for processing
|
| 270 |
│ └── processed/ # (Reserved for future use)
|
| 271 |
│
|
|
|
|
| 272 |
├── data/samples/
|
| 273 |
│ └── sample_invoice.jpg # Public sample for quick testing
|
| 274 |
│
|
| 275 |
├── docs/
|
| 276 |
+
│ └── screenshots/ # UI Screenshots for the README demo
|
|
|
|
| 277 |
│
|
| 278 |
├── models/
|
| 279 |
+
│ └── layoutlmv3-sroie-generalized/ # Fine-tuned model (created after training)
|
| 280 |
│
|
| 281 |
├── outputs/ # Default folder for saved JSON results
|
| 282 |
│
|
| 283 |
+
├── scripts/ # Training and analysis scripts
|
| 284 |
+
│ ├── train_combined.py # Main training loop (SROIE + Custom Data)
|
| 285 |
+
│ ├── eval_new_dataset.py # Evaluation scripts
|
| 286 |
+
│ └── explore_new_dataset.py # Dataset exploration tools
|
| 287 |
+
│
|
| 288 |
├── src/
|
| 289 |
+
│ ├── sroie_loader.py # SROIE dataset loading logic
|
| 290 |
+
│ ├── data_loader.py # Unified data loader for training
|
| 291 |
│ ├── preprocessing.py # Image preprocessing functions (grayscale, denoise)
|
| 292 |
│ ├── ocr.py # Tesseract OCR integration
|
| 293 |
│ ├── extraction.py # Regex-based information extraction logic
|
| 294 |
│ ├── ml_extraction.py # ML-based extraction (LayoutLMv3)
|
| 295 |
│ └── pipeline.py # Main orchestrator for the pipeline and CLI
|
| 296 |
│
|
| 297 |
+
├── tests/
|
| 298 |
+
│ ├── test_preprocessing.py # Tests for the preprocessing module
|
| 299 |
+
│ ├── test_ocr.py # Tests for the OCR module
|
| 300 |
+
│ └── test_pipeline.py # End-to-end pipeline tests
|
|
|
|
| 301 |
│
|
| 302 |
├── app.py # Streamlit web interface
|
| 303 |
├── requirements.txt # Python dependencies
|
|
|
|
| 313 |
- **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)
|
| 314 |
|
| 315 |
- Training scripts (local):
|
| 316 |
+
- `scripts/train_combined.py` (data prep, training loop with validation + model save)
|
| 317 |
+
- Model saved to: `models/layoutlmv3-sroie-generalized/`
|
| 318 |
|
| 319 |
## 📈 Performance
|
| 320 |
|
|
|
|
| 333 |
|
| 334 |
## 🔮 Future Enhancements
|
| 335 |
|
| 336 |
+
- [x] Add and fine‑tune on mychen76/invoices-and-receipts_ocr_v1 (English) for broader invoice formats
|
| 337 |
- [ ] (Optional) Add FATURA (table-focused) for line-item extraction
|
| 338 |
- [ ] Sliding-window chunking for >512 token documents (to avoid truncation)
|
| 339 |
- [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
|
| 340 |
- [ ] PDF support (pdf2image) for multipage invoices
|
| 341 |
+
- [x] FastAPI backend + Docker
|
| 342 |
- [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
|
| 343 |
- [ ] Confidence calibration and better validation rules
|
| 344 |
|
|
|
|
| 378 |
|
| 379 |
**Soumyajit Ghosh** - 3rd Year BTech Student
|
| 380 |
- Exploring AI/ML and practical applications
|
| 381 |
+
- [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-tech) | [GitHub](https://github.com/GSoumyajit2005) | [Portfolio](#) (Coming Soon)
|
| 382 |
|
| 383 |
---
|
| 384 |
|