GSoumyajit2005 commited on
Commit
1144bea
·
1 Parent(s): 4768ab6

docs: Update README to reflect new project structure

Browse files
Files changed (1) hide show
  1. README.md +30 -14
README.md CHANGED
@@ -28,6 +28,7 @@ A production-grade Hybrid Invoice Extraction System that combines the semantic u
28
  - **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.
29
 
30
  > Note on Invoice Numbers: The SROIE dataset used for training does not include "Invoice Number" labels. To solve this, the system uses a Hybrid Fallback Mechanism: if the ML model (LayoutLMv3) returns null for the Invoice Number, the system automatically triggers a targeted Regex extraction to ensure this critical field is captured.
 
31
  ---
32
 
33
  ## 🛠️ Technical Deep Dive (Why this architecture?)
@@ -60,6 +61,7 @@ The system outputs a clean JSON with the following fields:
60
  - `total_amount`: The total amount (extracted by LayoutLMv3 or Regex).
61
  - `extraction_confidence`: The confidence of the extraction (0-100).
62
  - `validation_passed`: Whether the validation passed (true/false).
 
63
  ---
64
 
65
  ## 📊 Demo
@@ -93,6 +95,7 @@ The system outputs a clean JSON with the following fields:
93
  "address": "NO JALAN BAYU 4, BANDAR SERI ALAM, 81750 MASAI, JOHOR"
94
  }
95
  ```
 
96
  ### Example JSON (ML-based)
97
  ```json
98
  {
@@ -141,6 +144,7 @@ source venv/bin/activate
141
  python -m venv venv
142
  .\venv\Scripts\activate
143
  ```
 
144
  3. Install dependencies
145
  ```bash
146
  pip install -r requirements.txt
@@ -158,6 +162,14 @@ pip install -r requirements.txt
158
  streamlit run app.py
159
  ```
160
 
 
 
 
 
 
 
 
 
161
  ## 💻 Usage
162
 
163
  ### Web Interface (Recommended)
@@ -257,31 +269,35 @@ invoice-processor-ml/
257
  │ ├── raw/ # Input invoice images for processing
258
  │ └── processed/ # (Reserved for future use)
259
 
260
-
261
  ├── data/samples/
262
  │ └── sample_invoice.jpg # Public sample for quick testing
263
 
264
  ├── docs/
265
- └── screenshots/ # UI Screenshots for the README demo
266
-
267
 
268
  ├── models/
269
- │ └── layoutlmv3-sroie-best/ # Fine-tuned model (created after training)
270
 
271
  ├── outputs/ # Default folder for saved JSON results
272
 
 
 
 
 
 
273
  ├── src/
 
 
274
  │ ├── preprocessing.py # Image preprocessing functions (grayscale, denoise)
275
  │ ├── ocr.py # Tesseract OCR integration
276
  │ ├── extraction.py # Regex-based information extraction logic
277
  │ ├── ml_extraction.py # ML-based extraction (LayoutLMv3)
278
  │ └── pipeline.py # Main orchestrator for the pipeline and CLI
279
 
280
-
281
- ├── tests/
282
- ├── test_preprocessing.py # Tests for the preprocessing module
283
- ── test_ocr.py # Tests for the OCR module
284
- │ └── test_pipeline.py # End-to-end pipeline tests
285
 
286
  ├── app.py # Streamlit web interface
287
  ├── requirements.txt # Python dependencies
@@ -297,8 +313,8 @@ invoice-processor-ml/
297
  - **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)
298
 
299
  - Training scripts (local):
300
- - `train_layoutlm.py` (data prep, training loop with validation + model save)
301
- - Model saved to: `models/layoutlmv3-sroie-best/`
302
 
303
  ## 📈 Performance
304
 
@@ -317,12 +333,12 @@ invoice-processor-ml/
317
 
318
  ## 🔮 Future Enhancements
319
 
320
- - [ ] Add and fine‑tune on mychen76/invoices-and-receipts_ocr_v1 (English) for broader invoice formats
321
  - [ ] (Optional) Add FATURA (table-focused) for line-item extraction
322
  - [ ] Sliding-window chunking for >512 token documents (to avoid truncation)
323
  - [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
324
  - [ ] PDF support (pdf2image) for multipage invoices
325
- - [ ] FastAPI backend + Docker
326
  - [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
327
  - [ ] Confidence calibration and better validation rules
328
 
@@ -362,7 +378,7 @@ MIT License - See LICENSE file for details
362
 
363
  **Soumyajit Ghosh** - 3rd Year BTech Student
364
  - Exploring AI/ML and practical applications
365
- - [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-49a5b02b2?utm_source=share&utm_campaign) | [GitHub](https://github.com/GSoumyajit2005) | [Portfolio](#)(Coming Soon)
366
 
367
  ---
368
 
 
28
  - **Auto-Validation:** Heuristic checks to validate that the extracted "Total Amount" matches the sum of line items.
29
 
30
  > Note on Invoice Numbers: The SROIE dataset used for training does not include "Invoice Number" labels. To solve this, the system uses a Hybrid Fallback Mechanism: if the ML model (LayoutLMv3) returns null for the Invoice Number, the system automatically triggers a targeted Regex extraction to ensure this critical field is captured.
31
+
32
  ---
33
 
34
  ## 🛠️ Technical Deep Dive (Why this architecture?)
 
61
  - `total_amount`: The total amount (extracted by LayoutLMv3 or Regex).
62
  - `extraction_confidence`: The confidence of the extraction (0-100).
63
  - `validation_passed`: Whether the validation passed (true/false).
64
+
65
  ---
66
 
67
  ## 📊 Demo
 
95
  "address": "NO JALAN BAYU 4, BANDAR SERI ALAM, 81750 MASAI, JOHOR"
96
  }
97
  ```
98
+
99
  ### Example JSON (ML-based)
100
  ```json
101
  {
 
144
  python -m venv venv
145
  .\venv\Scripts\activate
146
  ```
147
+
148
  3. Install dependencies
149
  ```bash
150
  pip install -r requirements.txt
 
162
  streamlit run app.py
163
  ```
164
 
165
+ ### Training the Model (Optional)
166
+ To retrain the model from scratch using the provided scripts:
167
+
168
+ ```bash
169
+ python scripts/train_combined.py
170
+ ```
171
+ (Note: Requires SROIE dataset in data/sroie)
172
+
173
  ## 💻 Usage
174
 
175
  ### Web Interface (Recommended)
 
269
  │ ├── raw/ # Input invoice images for processing
270
  │ └── processed/ # (Reserved for future use)
271
 
 
272
  ├── data/samples/
273
  │ └── sample_invoice.jpg # Public sample for quick testing
274
 
275
  ├── docs/
276
+ └── screenshots/ # UI Screenshots for the README demo
 
277
 
278
  ├── models/
279
+ │ └── layoutlmv3-sroie-generalized/ # Fine-tuned model (created after training)
280
 
281
  ├── outputs/ # Default folder for saved JSON results
282
 
283
+ ├── scripts/ # Training and analysis scripts
284
+ │ ├── train_combined.py # Main training loop (SROIE + Custom Data)
285
+ │ ├── eval_new_dataset.py # Evaluation scripts
286
+ │ └── explore_new_dataset.py # Dataset exploration tools
287
+
288
  ├── src/
289
+ │ ├── sroie_loader.py # SROIE dataset loading logic
290
+ │ ├── data_loader.py # Unified data loader for training
291
  │ ├── preprocessing.py # Image preprocessing functions (grayscale, denoise)
292
  │ ├── ocr.py # Tesseract OCR integration
293
  │ ├── extraction.py # Regex-based information extraction logic
294
  │ ├── ml_extraction.py # ML-based extraction (LayoutLMv3)
295
  │ └── pipeline.py # Main orchestrator for the pipeline and CLI
296
 
297
+ ├── tests/
298
+ ├── test_preprocessing.py # Tests for the preprocessing module
299
+ ├── test_ocr.py # Tests for the OCR module
300
+ ── test_pipeline.py # End-to-end pipeline tests
 
301
 
302
  ├── app.py # Streamlit web interface
303
  ├── requirements.txt # Python dependencies
 
313
  - **Result**: Best F1 ≈ 0.922 on validation (epoch 5 saved)
314
 
315
  - Training scripts (local):
316
+ - `scripts/train_combined.py` (data prep, training loop with validation + model save)
317
+ - Model saved to: `models/layoutlmv3-sroie-generalized/`
318
 
319
  ## 📈 Performance
320
 
 
333
 
334
  ## 🔮 Future Enhancements
335
 
336
+ - [x] Add and fine‑tune on mychen76/invoices-and-receipts_ocr_v1 (English) for broader invoice formats
337
  - [ ] (Optional) Add FATURA (table-focused) for line-item extraction
338
  - [ ] Sliding-window chunking for >512 token documents (to avoid truncation)
339
  - [ ] Table detection (Camelot/Tabula/DeepDeSRT) for line items
340
  - [ ] PDF support (pdf2image) for multipage invoices
341
+ - [x] FastAPI backend + Docker
342
  - [ ] Multilingual OCR (PaddleOCR) and multilingual fine‑tuning
343
  - [ ] Confidence calibration and better validation rules
344
 
 
378
 
379
  **Soumyajit Ghosh** - 3rd Year BTech Student
380
  - Exploring AI/ML and practical applications
381
+ - [LinkedIn](https://www.linkedin.com/in/soumyajit-ghosh-tech) | [GitHub](https://github.com/GSoumyajit2005) | [Portfolio](#) (Coming Soon)
382
 
383
  ---
384