MSME Document Presence Detection Model

Overview

This repository contains production-grade XGBoost models designed to detect the presence of mandatory documents in MSME arbitration cases based on OCR-extracted text.

The system performs binary classification for the following documents:

Invoice
Purchase Order
Delivery Proof
GST Certificate
Contract

Each document is modeled independently as a separate classifier.

Model Architecture

Algorithm: XGBoost (Gradient Boosted Trees)
Feature Extraction: TF-IDF (1–2 n-grams)
Max Features per model: 3000
Independent model per document type
Stratified train-test split
Hard negative augmentation included
Severe OCR corruption simulation included

Training Data

The model was trained on a synthetic and augmented dataset consisting of:

5,000 LLM-generated structured OCR samples
OCR distortion simulation
Keyword masking
Partial truncation
Cross-document contamination
Line shuffling
Hard negative construction
Class imbalance simulation

Final training dataset size: approximately 10,000 samples.

Performance

Average performance across all document classifiers:

Precision: 0.992
Recall: 0.987
F1 Score: 0.990
ROC-AUC: 0.999
False Negative Rate: < 2%

Performance evaluated using stratified 80/20 split.

Inference

Each model expects OCR-extracted raw text for a specific document type.

Output per document:

Binary prediction (0 = Missing, 1 = Present)
Probability score
Optional SHAP-based explainability (external implementation)

Completeness Score can be computed as:

completeness = (documents_present / required_documents) × 100

Intended Use

This model is suitable for:

MSME arbitration automation
Legal document validation pipelines
OCR post-processing systems
Document completeness scoring engines
Hybrid rule + ML legal systems

Limitations

Trained primarily on synthetic and augmented OCR data
Real-world scanned PDFs may introduce unseen distortions
Extreme low-quality scans may reduce recall
Contract optionality logic must be implemented externally
Not intended for semantic contract analysis

Ethical Considerations

The model was trained exclusively on synthetic data. No real personal, financial, or legal records were used.

Future Work

Fine-tuning on real arbitration case documents
Probability calibration
Threshold optimization per document type
Model drift monitoring
Ensemble rule + ML integration
ONNX export for optimized inference

License

This project is released under the MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Precision (avg) on MSME Document Presence Dataset
self-reported

0.992
Recall (avg) on MSME Document Presence Dataset
self-reported

0.987
F1 Score (avg) on MSME Document Presence Dataset
self-reported

0.990
ROC-AUC (avg) on MSME Document Presence Dataset
self-reported

0.999