vidyasagar786 commited on
Commit
ea5f538
Β·
verified Β·
1 Parent(s): 907307c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - document-classification
5
+ - xgboost
6
+ - tfidf
7
+ - sklearn
8
+ - text-classification
9
+ datasets:
10
+ - uditamin/rvl-cdip-small
11
+ language:
12
+ - en
13
+ ---
14
+
15
+ # πŸ“„ Document Classifier β€” XGBoost + TF-IDF
16
+
17
+ A lightweight, high-performance **document classification model** trained on the
18
+ [RVL-CDIP Small](https://huggingface.co/datasets/uditamin/rvl-cdip-small) dataset.
19
+
20
+ It classifies scanned/OCR-processed documents into their category using
21
+ handcrafted **TF-IDF** (word & character n-gram) features combined with
22
+ numeric heuristic features, fed into an **XGBoost** classifier.
23
+
24
+ ---
25
+
26
+ ## πŸ—οΈ Model Architecture
27
+
28
+ | Component | Details |
29
+ |---|---|
30
+ | Classifier | XGBoost (`XGBClassifier`) |
31
+ | Text features | TF-IDF word n-grams (1–2), char n-grams (3–5) |
32
+ | Numeric features | `char_count`, `digit_count`, `uppercase_count`, `currency_count`, `line_count` |
33
+ | Scaler | `StandardScaler` (on numeric features) |
34
+ | Training rounds | 400 estimators, early stopping (30 rounds) |
35
+
36
+ ---
37
+
38
+ ## πŸ“¦ Files
39
+
40
+ | File | Description |
41
+ |---|---|
42
+ | `document_classifier_xgb.pkl` | Serialised model bundle (joblib) β€” contains model + vectorizers + scaler |
43
+ | `predict_document.py` | Ready-to-use inference script |
44
+ | `train_model.py` | Full training script |
45
+ | `training_curve.png` | Train vs validation log-loss curve |
46
+ | `feature_importance.png` | Top-20 feature importances |
47
+
48
+ ---
49
+
50
+ ## πŸš€ Quick Start
51
+
52
+ ```python
53
+ import joblib
54
+
55
+ # Load the model bundle
56
+ bundle = joblib.load("document_classifier_xgb.pkl")
57
+ model = bundle["model"]
58
+ word_vectorizer = bundle["word_vectorizer"]
59
+ char_vectorizer = bundle["char_vectorizer"]
60
+ scaler = bundle["scaler"]
61
+
62
+ from scipy.sparse import hstack, csr_matrix
63
+ import numpy as np
64
+
65
+ def predict(text: str) -> int:
66
+ word_feat = word_vectorizer.transform([text])
67
+ char_feat = char_vectorizer.transform([text])
68
+ num_feat = scaler.transform([[
69
+ len(text), # char_count
70
+ sum(c.isdigit() for c in text), # digit_count
71
+ sum(c.isupper() for c in text), # uppercase_count
72
+ text.count("$") + text.count("Β£"), # currency_count
73
+ text.count("\n"), # line_count
74
+ ]])
75
+ features = hstack([word_feat, char_feat, csr_matrix(num_feat)])
76
+ return int(model.predict(features)[0])
77
+
78
+ label = predict("Invoice No. 12345 Total: $499.99 Date: 01/01/2024")
79
+ print("Predicted label:", label)
80
+ ```
81
+
82
+ ---
83
+
84
+ ## πŸ“Š Training Details
85
+
86
+ - **Dataset**: RVL-CDIP Small (train / val / test split)
87
+ - **Objective**: `multi:softprob` (multi-class log loss)
88
+ - **Hardware**: CPU
89
+ - **Framework**: XGBoost 2.x, scikit-learn, joblib
90
+
91
+ ---
92
+
93
+ ## πŸ“ License
94
+
95
+ MIT