MSME Document Presence Detection Model
Overview
This repository contains production-grade XGBoost models designed to detect the presence of mandatory documents in MSME arbitration cases based on OCR-extracted text.
The system performs binary classification for the following documents:
- Invoice
- Purchase Order
- Delivery Proof
- GST Certificate
- Contract
Each document is modeled independently as a separate classifier.
Model Architecture
- Algorithm: XGBoost (Gradient Boosted Trees)
- Feature Extraction: TF-IDF (1–2 n-grams)
- Max Features per model: 3000
- Independent model per document type
- Stratified train-test split
- Hard negative augmentation included
- Severe OCR corruption simulation included
Training Data
The model was trained on a synthetic and augmented dataset consisting of:
- 5,000 LLM-generated structured OCR samples
- OCR distortion simulation
- Keyword masking
- Partial truncation
- Cross-document contamination
- Line shuffling
- Hard negative construction
- Class imbalance simulation
Final training dataset size: approximately 10,000 samples.
Performance
Average performance across all document classifiers:
- Precision: 0.992
- Recall: 0.987
- F1 Score: 0.990
- ROC-AUC: 0.999
- False Negative Rate: < 2%
Performance evaluated using stratified 80/20 split.
Inference
Each model expects OCR-extracted raw text for a specific document type.
Output per document:
- Binary prediction (0 = Missing, 1 = Present)
- Probability score
- Optional SHAP-based explainability (external implementation)
Completeness Score can be computed as:
completeness = (documents_present / required_documents) × 100
Intended Use
This model is suitable for:
- MSME arbitration automation
- Legal document validation pipelines
- OCR post-processing systems
- Document completeness scoring engines
- Hybrid rule + ML legal systems
Limitations
- Trained primarily on synthetic and augmented OCR data
- Real-world scanned PDFs may introduce unseen distortions
- Extreme low-quality scans may reduce recall
- Contract optionality logic must be implemented externally
- Not intended for semantic contract analysis
Ethical Considerations
The model was trained exclusively on synthetic data. No real personal, financial, or legal records were used.
Future Work
- Fine-tuning on real arbitration case documents
- Probability calibration
- Threshold optimization per document type
- Model drift monitoring
- Ensemble rule + ML integration
- ONNX export for optimized inference
License
This project is released under the MIT License.
Evaluation results
- Precision (avg) on MSME Document Presence Datasetself-reported0.992
- Recall (avg) on MSME Document Presence Datasetself-reported0.987
- F1 Score (avg) on MSME Document Presence Datasetself-reported0.990
- ROC-AUC (avg) on MSME Document Presence Datasetself-reported0.999