abhinavdread
/

msme-document-presence-xgboost

+---
+license: mit
+library_name: xgboost
+tags:
+  - text-classification
+  - document-analysis
+  - ocr
+  - legal-tech
+  - msme
+  - binary-classification
+  - tabular-text
+pipeline_tag: text-classification
+model-index:
+  - name: MSME Document Presence Detection Model
+    results:
+      - task:
+          type: text-classification
+        dataset:
+          name: MSME Document Presence Dataset
+          type: custom
+        metrics:
+          - name: Precision (avg)
+            type: precision
+            value: 0.992
+          - name: Recall (avg)
+            type: recall
+            value: 0.987
+          - name: F1 Score (avg)
+            type: f1
+            value: 0.990
+          - name: ROC-AUC (avg)
+            type: roc_auc
+            value: 0.999
+---
+# MSME Document Presence Detection Model
+## Overview
+This repository contains production-grade XGBoost models designed to detect the presence of mandatory documents in MSME arbitration cases based on OCR-extracted text.
+The system performs binary classification for the following documents:
+- Invoice
+- Purchase Order
+- Delivery Proof
+- GST Certificate
+- Contract
+Each document is modeled independently as a separate classifier.
+---
+## Model Architecture
+- Algorithm: XGBoost (Gradient Boosted Trees)
+- Feature Extraction: TF-IDF (1–2 n-grams)
+- Max Features per model: 3000
+- Independent model per document type
+- Stratified train-test split
+- Hard negative augmentation included
+- Severe OCR corruption simulation included
+---
+## Training Data
+The model was trained on a synthetic and augmented dataset consisting of:
+- 5,000 LLM-generated structured OCR samples
+- OCR distortion simulation
+- Keyword masking
+- Partial truncation
+- Cross-document contamination
+- Line shuffling
+- Hard negative construction
+- Class imbalance simulation
+Final training dataset size: approximately 10,000 samples.
+---
+## Performance
+Average performance across all document classifiers:
+- Precision: 0.992
+- Recall: 0.987
+- F1 Score: 0.990
+- ROC-AUC: 0.999
+- False Negative Rate: < 2%
+Performance evaluated using stratified 80/20 split.
+---
+## Inference
+Each model expects OCR-extracted raw text for a specific document type.
+Output per document:
+- Binary prediction (0 = Missing, 1 = Present)
+- Probability score
+- Optional SHAP-based explainability (external implementation)
+Completeness Score can be computed as:
+completeness = (documents_present / required_documents) × 100
+---
+## Intended Use
+This model is suitable for:
+- MSME arbitration automation
+- Legal document validation pipelines
+- OCR post-processing systems
+- Document completeness scoring engines
+- Hybrid rule + ML legal systems
+---
+## Limitations
+- Trained primarily on synthetic and augmented OCR data
+- Real-world scanned PDFs may introduce unseen distortions
+- Extreme low-quality scans may reduce recall
+- Contract optionality logic must be implemented externally
+- Not intended for semantic contract analysis
+---
+## Ethical Considerations
+The model was trained exclusively on synthetic data.
+No real personal, financial, or legal records were used.
+---
+## Future Work
+- Fine-tuning on real arbitration case documents
+- Probability calibration
+- Threshold optimization per document type
+- Model drift monitoring
+- Ensemble rule + ML integration
+- ONNX export for optimized inference
+---
+## License
+This project is released under the MIT License.