abhinavdread commited on
Commit
a688a84
·
verified ·
1 Parent(s): d6499ac

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -0
README.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: xgboost
4
+ tags:
5
+ - text-classification
6
+ - document-analysis
7
+ - ocr
8
+ - legal-tech
9
+ - msme
10
+ - binary-classification
11
+ - tabular-text
12
+ pipeline_tag: text-classification
13
+ model-index:
14
+ - name: MSME Document Presence Detection Model
15
+ results:
16
+ - task:
17
+ type: text-classification
18
+ dataset:
19
+ name: MSME Document Presence Dataset
20
+ type: custom
21
+ metrics:
22
+ - name: Precision (avg)
23
+ type: precision
24
+ value: 0.992
25
+ - name: Recall (avg)
26
+ type: recall
27
+ value: 0.987
28
+ - name: F1 Score (avg)
29
+ type: f1
30
+ value: 0.990
31
+ - name: ROC-AUC (avg)
32
+ type: roc_auc
33
+ value: 0.999
34
+ ---
35
+
36
+ # MSME Document Presence Detection Model
37
+
38
+ ## Overview
39
+
40
+ This repository contains production-grade XGBoost models designed to detect the presence of mandatory documents in MSME arbitration cases based on OCR-extracted text.
41
+
42
+ The system performs binary classification for the following documents:
43
+
44
+ - Invoice
45
+ - Purchase Order
46
+ - Delivery Proof
47
+ - GST Certificate
48
+ - Contract
49
+
50
+ Each document is modeled independently as a separate classifier.
51
+
52
+ ---
53
+
54
+ ## Model Architecture
55
+
56
+ - Algorithm: XGBoost (Gradient Boosted Trees)
57
+ - Feature Extraction: TF-IDF (1–2 n-grams)
58
+ - Max Features per model: 3000
59
+ - Independent model per document type
60
+ - Stratified train-test split
61
+ - Hard negative augmentation included
62
+ - Severe OCR corruption simulation included
63
+
64
+ ---
65
+
66
+ ## Training Data
67
+
68
+ The model was trained on a synthetic and augmented dataset consisting of:
69
+
70
+ - 5,000 LLM-generated structured OCR samples
71
+ - OCR distortion simulation
72
+ - Keyword masking
73
+ - Partial truncation
74
+ - Cross-document contamination
75
+ - Line shuffling
76
+ - Hard negative construction
77
+ - Class imbalance simulation
78
+
79
+ Final training dataset size: approximately 10,000 samples.
80
+
81
+ ---
82
+
83
+ ## Performance
84
+
85
+ Average performance across all document classifiers:
86
+
87
+ - Precision: 0.992
88
+ - Recall: 0.987
89
+ - F1 Score: 0.990
90
+ - ROC-AUC: 0.999
91
+ - False Negative Rate: < 2%
92
+
93
+ Performance evaluated using stratified 80/20 split.
94
+
95
+ ---
96
+
97
+ ## Inference
98
+
99
+ Each model expects OCR-extracted raw text for a specific document type.
100
+
101
+ Output per document:
102
+
103
+ - Binary prediction (0 = Missing, 1 = Present)
104
+ - Probability score
105
+ - Optional SHAP-based explainability (external implementation)
106
+
107
+ Completeness Score can be computed as:
108
+
109
+ completeness = (documents_present / required_documents) × 100
110
+
111
+ ---
112
+
113
+ ## Intended Use
114
+
115
+ This model is suitable for:
116
+
117
+ - MSME arbitration automation
118
+ - Legal document validation pipelines
119
+ - OCR post-processing systems
120
+ - Document completeness scoring engines
121
+ - Hybrid rule + ML legal systems
122
+
123
+ ---
124
+
125
+ ## Limitations
126
+
127
+ - Trained primarily on synthetic and augmented OCR data
128
+ - Real-world scanned PDFs may introduce unseen distortions
129
+ - Extreme low-quality scans may reduce recall
130
+ - Contract optionality logic must be implemented externally
131
+ - Not intended for semantic contract analysis
132
+
133
+ ---
134
+
135
+ ## Ethical Considerations
136
+
137
+ The model was trained exclusively on synthetic data.
138
+ No real personal, financial, or legal records were used.
139
+
140
+ ---
141
+
142
+ ## Future Work
143
+
144
+ - Fine-tuning on real arbitration case documents
145
+ - Probability calibration
146
+ - Threshold optimization per document type
147
+ - Model drift monitoring
148
+ - Ensemble rule + ML integration
149
+ - ONNX export for optimized inference
150
+
151
+ ---
152
+
153
+ ## License
154
+
155
+ This project is released under the MIT License.