MSME Document Presence Detection Model

Overview

This repository contains production-grade XGBoost models designed to detect the presence of mandatory documents in MSME arbitration cases based on OCR-extracted text.

The system performs binary classification for the following documents:

  • Invoice
  • Purchase Order
  • Delivery Proof
  • GST Certificate
  • Contract

Each document is modeled independently as a separate classifier.


Model Architecture

  • Algorithm: XGBoost (Gradient Boosted Trees)
  • Feature Extraction: TF-IDF (1–2 n-grams)
  • Max Features per model: 3000
  • Independent model per document type
  • Stratified train-test split
  • Hard negative augmentation included
  • Severe OCR corruption simulation included

Training Data

The model was trained on a synthetic and augmented dataset consisting of:

  • 5,000 LLM-generated structured OCR samples
  • OCR distortion simulation
  • Keyword masking
  • Partial truncation
  • Cross-document contamination
  • Line shuffling
  • Hard negative construction
  • Class imbalance simulation

Final training dataset size: approximately 10,000 samples.


Performance

Average performance across all document classifiers:

  • Precision: 0.992
  • Recall: 0.987
  • F1 Score: 0.990
  • ROC-AUC: 0.999
  • False Negative Rate: < 2%

Performance evaluated using stratified 80/20 split.


Inference

Each model expects OCR-extracted raw text for a specific document type.

Output per document:

  • Binary prediction (0 = Missing, 1 = Present)
  • Probability score
  • Optional SHAP-based explainability (external implementation)

Completeness Score can be computed as:

completeness = (documents_present / required_documents) × 100


Intended Use

This model is suitable for:

  • MSME arbitration automation
  • Legal document validation pipelines
  • OCR post-processing systems
  • Document completeness scoring engines
  • Hybrid rule + ML legal systems

Limitations

  • Trained primarily on synthetic and augmented OCR data
  • Real-world scanned PDFs may introduce unseen distortions
  • Extreme low-quality scans may reduce recall
  • Contract optionality logic must be implemented externally
  • Not intended for semantic contract analysis

Ethical Considerations

The model was trained exclusively on synthetic data. No real personal, financial, or legal records were used.


Future Work

  • Fine-tuning on real arbitration case documents
  • Probability calibration
  • Threshold optimization per document type
  • Model drift monitoring
  • Ensemble rule + ML integration
  • ONNX export for optimized inference

License

This project is released under the MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results