abhinavdread
/

msme-document-completeness-scorer

+---
+library_name: sklearn
+tags:
+  - xgboost
+  - text-classification
+  - msme
+  - legal
+  - finance
+  - document-scoring
+  - dispute-resolution
+pipeline_tag: text-classification
+language:
+  - en
+widget:
+  - text: "TAX INVOICE | Inv No: 9988 | Total: 50,000 INR | GSTIN: 27AAAAA0000A1Z5"
+    example_title: "Valid Tax Invoice"
+  - text: "Please send me the invoice by tomorrow evening."
+    example_title: "Email Request (Negative)"
+---
+# MSME Document Completeness Scorer
+## Model Overview
+This repository hosts an ensemble of **5 independent Binary XGBoost Classifiers** that automate the **Document Completeness Scoring** step in Indian MSME (Micro, Small, and Medium Enterprises) dispute resolution workflows.
+Each classifier is a serialized `scikit-learn` pipeline (`TfidfVectorizer` → `XGBoostClassifier`) that detects the presence or absence of one specific mandatory document type from raw OCR-extracted text. The models are designed to be robust against common real-world challenges including OCR noise, scanned document artifacts, and adversarial near-miss inputs such as Proforma Invoices or Draft documents, which structurally resemble valid legal documents but are legally insufficient for dispute filings.
+---
+## Included Models
+| Model File | Target Document Type | Precision (Missing) | Recall (Present) |
+| :--- | :--- | :---: | :---: |
+| `invoice_model.pkl` | Tax Invoice | ~99% | ~90% |
+| `po_model.pkl` | Purchase Order | ~99% | ~87% |
+| `delivery_model.pkl` | Delivery Challan / Proof of Delivery | ~99% | ~90% |
+| `gst_model.pkl` | GST Registration Certificate | ~99% | ~90% |
+| `contract_model.pkl` | Supply Agreement / Contract | ~99% | ~90% |
+> Models were trained on the `msme-dispute-document-corpus`, a synthetic OCR dataset of 8,000+ samples generated via Gemini 2.5 Flash.
+---
+## Intended Use Cases
+This model suite is intended for:
+- **Dispute Resolution Platforms** — Automatically flagging missing evidence documents in arbitration or legal case files before human review.
+- **MSME Samadhaan Portals** — Programmatically filtering incomplete applications to reduce officer workload.
+- **Legal Tech Pipelines** — Converting unstructured text dumps from scanned case files into structured document-presence classifications.
+### Out-of-Scope Use
+These models are not intended for general-purpose document classification, non-Indian business contexts, or languages other than English.
+---
+## Getting Started
+### Installation
+```bash
+pip install scikit-learn xgboost pandas joblib huggingface_hub
+```
+### Loading the Models
+```python
+import joblib
+from huggingface_hub import hf_hub_download
+# Replace with your actual Hugging Face repository ID
+REPO_ID = "your-username/msme-document-completeness-scorer"
+MODEL_FILES = {
+    "invoice":  "invoice_model.pkl",
+    "po":       "po_model.pkl",
+    "delivery": "delivery_model.pkl",
+    "gst":      "gst_model.pkl",
+    "contract": "contract_model.pkl",
+}
+models = {}
+for doc_type, filename in MODEL_FILES.items():
+    model_path = hf_hub_download(repo_id=REPO_ID, filename=filename)
+    models[doc_type] = joblib.load(model_path)
+    print(f"Loaded: {filename}")
+```
+### Running Inference
+```python
+def predict_document_status(text: str, doc_type: str) -> tuple[str, float]:
+    """
+    Predicts whether a given document type is present in the provided OCR text.
+    Args:
+        text:     Raw text string extracted from a scanned document via OCR.
+        doc_type: Document classifier key. One of: 'invoice', 'po',
+                  'delivery', 'gst', 'contract'.
+    Returns:
+        status:     'Present' if the document type is detected, else 'Missing'.
+        confidence: Probability score (0.0 to 1.0) from the classifier.
+    """
+    model = models.get(doc_type)
+    if not model:
+        raise ValueError(f"No model loaded for doc_type='{doc_type}'.")
+    # predict_proba returns [[prob_class_0 (Missing), prob_class_1 (Present)]]
+    confidence = model.predict_proba([text])[0][1]
+    # Production threshold: >= 0.85 confidence required to classify as Present
+    status = "Present" if confidence >= 0.85 else "Missing"
+    return status, confidence
+# Example
+sample_text = """
+TAX INVOICE
+Inv No: INV-2024-001
+Date: 12/12/2024
+Total: 50,000 INR
+GSTIN: 29ABCDE1234F1Z5
+"""
+status, confidence = predict_document_status(sample_text, "invoice")
+print(f"Status     : {status}")
+print(f"Confidence : {confidence:.4f}")
+```
+**Expected output:**
+```
+Status     : Present
+Confidence : 0.9731
+```
+### Scoring a Full Case File
+To check completeness across all five mandatory document types at once:
+```python
+def score_case_file(documents: dict[str, str]) -> dict:
+    """
+    Args:
+        documents: A dict mapping doc_type keys to their OCR-extracted text.
+                   Example: {"invoice": "...", "po": "...", "gst": "..."}
+    Returns:
+        A results dict with status and confidence per document type,
+        plus a top-level 'is_complete' boolean flag.
+    """
+    results = {}
+    for doc_type, text in documents.items():
+        status, confidence = predict_document_status(text, doc_type)
+        results[doc_type] = {"status": status, "confidence": round(confidence, 4)}
+    results["is_complete"] = all(
+        v["status"] == "Present" for v in results.values() if isinstance(v, dict)
+    )
+    return results
+```
+---
+## Technical Details
+### Architecture
+Each model is a two-stage `scikit-learn` pipeline:
+1. **TF-IDF Vectorizer** — Converts raw OCR text into a sparse term-frequency matrix. Configured with sublinear TF scaling and character n-gram ranges tuned for OCR noise tolerance.
+2. **XGBoost Classifier** — Gradient-boosted tree classifier trained on the resulting feature vectors with binary cross-entropy loss.
+### Classification Threshold
+The default decision threshold is **0.85** (rather than the standard 0.50). This was selected to minimize false positives on the *Missing* class — the models require high confidence before declaring a document present. This conservative threshold is appropriate for legal and compliance workflows where falsely accepting an incomplete filing carries greater risk than requesting resubmission.
+### Training Data
+| Property | Details |
+| :--- | :--- |
+| Dataset | `msme-dispute-document-corpus` (synthetic) |
+| Generation Method | Gemini 2.5 Flash with structured OCR simulation |
+| Total Samples | 8,000+ labeled examples across all 5 document classes |
+| Noise Augmentation | OCR character substitutions, broken line breaks, skewed formatting |
+| Adversarial Samples | Proforma invoices, draft purchase orders, unsigned contracts |
+---
+## Limitations
+**Synthetic Training Distribution.** All training data is synthetically generated. While OCR noise augmentation is applied, model behavior on extremely degraded scans (e.g., below 150 DPI, severe skew, or handwritten annotations) is not guaranteed and should be validated on representative production samples before deployment.
+**Language and Locale.** Models are optimized exclusively for English-language documents using Indian business conventions — INR currency formatting, GSTIN identifiers, and Indian-specific terminology such as "Challan". Performance on documents from other jurisdictions or in regional languages is untested.
+**OCR Dependency.** These models process text only. PDF, image, or scanned document inputs must be pre-processed through an external OCR engine before inference. Prediction quality is directly bounded by the quality of the OCR output.
+### Compatible OCR Engines
+| Engine | Type | Notes |
+| :--- | :--- | :--- |
+| [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) | Open-source | Good baseline; benefits from image pre-processing |
+| [Azure AI Document Intelligence](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence) | Managed API | Strong performance on structured forms and tables |
+| [Google Cloud Vision API](https://cloud.google.com/vision) | Managed API | Reliable across varied scan quality |
+---
+## Citation
+If you use this model in research or production, please cite this repository and acknowledge the synthetic training corpus.
+---
+## License
+See `LICENSE` for full terms.