abhinavdread commited on
Commit
a6d5e1b
·
verified ·
1 Parent(s): 83ecaea

Upload README.md

Browse files

updated reamde.md with Hf yaml structure

Files changed (1) hide show
  1. README.md +219 -0
README.md ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: sklearn
3
+ tags:
4
+ - xgboost
5
+ - text-classification
6
+ - msme
7
+ - legal
8
+ - finance
9
+ - document-scoring
10
+ - dispute-resolution
11
+ pipeline_tag: text-classification
12
+ language:
13
+ - en
14
+ widget:
15
+ - text: "TAX INVOICE | Inv No: 9988 | Total: 50,000 INR | GSTIN: 27AAAAA0000A1Z5"
16
+ example_title: "Valid Tax Invoice"
17
+ - text: "Please send me the invoice by tomorrow evening."
18
+ example_title: "Email Request (Negative)"
19
+ ---
20
+
21
+ # MSME Document Completeness Scorer
22
+
23
+ ## Model Overview
24
+
25
+ This repository hosts an ensemble of **5 independent Binary XGBoost Classifiers** that automate the **Document Completeness Scoring** step in Indian MSME (Micro, Small, and Medium Enterprises) dispute resolution workflows.
26
+
27
+ Each classifier is a serialized `scikit-learn` pipeline (`TfidfVectorizer` → `XGBoostClassifier`) that detects the presence or absence of one specific mandatory document type from raw OCR-extracted text. The models are designed to be robust against common real-world challenges including OCR noise, scanned document artifacts, and adversarial near-miss inputs such as Proforma Invoices or Draft documents, which structurally resemble valid legal documents but are legally insufficient for dispute filings.
28
+
29
+ ---
30
+
31
+ ## Included Models
32
+
33
+ | Model File | Target Document Type | Precision (Missing) | Recall (Present) |
34
+ | :--- | :--- | :---: | :---: |
35
+ | `invoice_model.pkl` | Tax Invoice | ~99% | ~90% |
36
+ | `po_model.pkl` | Purchase Order | ~99% | ~87% |
37
+ | `delivery_model.pkl` | Delivery Challan / Proof of Delivery | ~99% | ~90% |
38
+ | `gst_model.pkl` | GST Registration Certificate | ~99% | ~90% |
39
+ | `contract_model.pkl` | Supply Agreement / Contract | ~99% | ~90% |
40
+
41
+ > Models were trained on the `msme-dispute-document-corpus`, a synthetic OCR dataset of 8,000+ samples generated via Gemini 2.5 Flash.
42
+
43
+ ---
44
+
45
+ ## Intended Use Cases
46
+
47
+ This model suite is intended for:
48
+
49
+ - **Dispute Resolution Platforms** — Automatically flagging missing evidence documents in arbitration or legal case files before human review.
50
+ - **MSME Samadhaan Portals** — Programmatically filtering incomplete applications to reduce officer workload.
51
+ - **Legal Tech Pipelines** — Converting unstructured text dumps from scanned case files into structured document-presence classifications.
52
+
53
+ ### Out-of-Scope Use
54
+
55
+ These models are not intended for general-purpose document classification, non-Indian business contexts, or languages other than English.
56
+
57
+ ---
58
+
59
+ ## Getting Started
60
+
61
+ ### Installation
62
+
63
+ ```bash
64
+ pip install scikit-learn xgboost pandas joblib huggingface_hub
65
+ ```
66
+
67
+ ### Loading the Models
68
+
69
+ ```python
70
+ import joblib
71
+ from huggingface_hub import hf_hub_download
72
+
73
+ # Replace with your actual Hugging Face repository ID
74
+ REPO_ID = "your-username/msme-document-completeness-scorer"
75
+
76
+ MODEL_FILES = {
77
+ "invoice": "invoice_model.pkl",
78
+ "po": "po_model.pkl",
79
+ "delivery": "delivery_model.pkl",
80
+ "gst": "gst_model.pkl",
81
+ "contract": "contract_model.pkl",
82
+ }
83
+
84
+ models = {}
85
+ for doc_type, filename in MODEL_FILES.items():
86
+ model_path = hf_hub_download(repo_id=REPO_ID, filename=filename)
87
+ models[doc_type] = joblib.load(model_path)
88
+ print(f"Loaded: {filename}")
89
+ ```
90
+
91
+ ### Running Inference
92
+
93
+ ```python
94
+ def predict_document_status(text: str, doc_type: str) -> tuple[str, float]:
95
+ """
96
+ Predicts whether a given document type is present in the provided OCR text.
97
+
98
+ Args:
99
+ text: Raw text string extracted from a scanned document via OCR.
100
+ doc_type: Document classifier key. One of: 'invoice', 'po',
101
+ 'delivery', 'gst', 'contract'.
102
+
103
+ Returns:
104
+ status: 'Present' if the document type is detected, else 'Missing'.
105
+ confidence: Probability score (0.0 to 1.0) from the classifier.
106
+ """
107
+ model = models.get(doc_type)
108
+ if not model:
109
+ raise ValueError(f"No model loaded for doc_type='{doc_type}'.")
110
+
111
+ # predict_proba returns [[prob_class_0 (Missing), prob_class_1 (Present)]]
112
+ confidence = model.predict_proba([text])[0][1]
113
+
114
+ # Production threshold: >= 0.85 confidence required to classify as Present
115
+ status = "Present" if confidence >= 0.85 else "Missing"
116
+ return status, confidence
117
+
118
+
119
+ # Example
120
+ sample_text = """
121
+ TAX INVOICE
122
+ Inv No: INV-2024-001
123
+ Date: 12/12/2024
124
+ Total: 50,000 INR
125
+ GSTIN: 29ABCDE1234F1Z5
126
+ """
127
+
128
+ status, confidence = predict_document_status(sample_text, "invoice")
129
+ print(f"Status : {status}")
130
+ print(f"Confidence : {confidence:.4f}")
131
+ ```
132
+
133
+ **Expected output:**
134
+
135
+ ```
136
+ Status : Present
137
+ Confidence : 0.9731
138
+ ```
139
+
140
+ ### Scoring a Full Case File
141
+
142
+ To check completeness across all five mandatory document types at once:
143
+
144
+ ```python
145
+ def score_case_file(documents: dict[str, str]) -> dict:
146
+ """
147
+ Args:
148
+ documents: A dict mapping doc_type keys to their OCR-extracted text.
149
+ Example: {"invoice": "...", "po": "...", "gst": "..."}
150
+
151
+ Returns:
152
+ A results dict with status and confidence per document type,
153
+ plus a top-level 'is_complete' boolean flag.
154
+ """
155
+ results = {}
156
+ for doc_type, text in documents.items():
157
+ status, confidence = predict_document_status(text, doc_type)
158
+ results[doc_type] = {"status": status, "confidence": round(confidence, 4)}
159
+
160
+ results["is_complete"] = all(
161
+ v["status"] == "Present" for v in results.values() if isinstance(v, dict)
162
+ )
163
+ return results
164
+ ```
165
+
166
+ ---
167
+
168
+ ## Technical Details
169
+
170
+ ### Architecture
171
+
172
+ Each model is a two-stage `scikit-learn` pipeline:
173
+
174
+ 1. **TF-IDF Vectorizer** — Converts raw OCR text into a sparse term-frequency matrix. Configured with sublinear TF scaling and character n-gram ranges tuned for OCR noise tolerance.
175
+ 2. **XGBoost Classifier** — Gradient-boosted tree classifier trained on the resulting feature vectors with binary cross-entropy loss.
176
+
177
+ ### Classification Threshold
178
+
179
+ The default decision threshold is **0.85** (rather than the standard 0.50). This was selected to minimize false positives on the *Missing* class — the models require high confidence before declaring a document present. This conservative threshold is appropriate for legal and compliance workflows where falsely accepting an incomplete filing carries greater risk than requesting resubmission.
180
+
181
+ ### Training Data
182
+
183
+ | Property | Details |
184
+ | :--- | :--- |
185
+ | Dataset | `msme-dispute-document-corpus` (synthetic) |
186
+ | Generation Method | Gemini 2.5 Flash with structured OCR simulation |
187
+ | Total Samples | 8,000+ labeled examples across all 5 document classes |
188
+ | Noise Augmentation | OCR character substitutions, broken line breaks, skewed formatting |
189
+ | Adversarial Samples | Proforma invoices, draft purchase orders, unsigned contracts |
190
+
191
+ ---
192
+
193
+ ## Limitations
194
+
195
+ **Synthetic Training Distribution.** All training data is synthetically generated. While OCR noise augmentation is applied, model behavior on extremely degraded scans (e.g., below 150 DPI, severe skew, or handwritten annotations) is not guaranteed and should be validated on representative production samples before deployment.
196
+
197
+ **Language and Locale.** Models are optimized exclusively for English-language documents using Indian business conventions — INR currency formatting, GSTIN identifiers, and Indian-specific terminology such as "Challan". Performance on documents from other jurisdictions or in regional languages is untested.
198
+
199
+ **OCR Dependency.** These models process text only. PDF, image, or scanned document inputs must be pre-processed through an external OCR engine before inference. Prediction quality is directly bounded by the quality of the OCR output.
200
+
201
+ ### Compatible OCR Engines
202
+
203
+ | Engine | Type | Notes |
204
+ | :--- | :--- | :--- |
205
+ | [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) | Open-source | Good baseline; benefits from image pre-processing |
206
+ | [Azure AI Document Intelligence](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence) | Managed API | Strong performance on structured forms and tables |
207
+ | [Google Cloud Vision API](https://cloud.google.com/vision) | Managed API | Reliable across varied scan quality |
208
+
209
+ ---
210
+
211
+ ## Citation
212
+
213
+ If you use this model in research or production, please cite this repository and acknowledge the synthetic training corpus.
214
+
215
+ ---
216
+
217
+ ## License
218
+
219
+ See `LICENSE` for full terms.