File size: 3,671 Bytes
fb6d066
140365e
fb6d066
140365e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb6d066
 
66cd2ef
fb6d066
66cd2ef
fb6d066
66cd2ef
fb6d066
66cd2ef
 
 
 
 
 
 
140365e
 
 
66cd2ef
140365e
66cd2ef
fb6d066
66cd2ef
140365e
66cd2ef
140365e
66cd2ef
140365e
66cd2ef
 
 
 
 
140365e
 
fb6d066
66cd2ef
140365e
 
66cd2ef
 
 
 
 
 
 
 
 
 
 
 
140365e
 
 
 
66cd2ef
140365e
 
66cd2ef
 
 
0ba0075
 
 
66cd2ef
0ba0075
66cd2ef
 
 
 
 
 
fb6d066
 
66cd2ef
140365e
 
 
66cd2ef
140365e
 
 
66cd2ef
 
 
 
 
 
140365e
 
 
66cd2ef
140365e
 
66cd2ef
140365e
66cd2ef
140365e
0ba0075
140365e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
license: mit
tags:
- tabular-classification
- gradient-boosting
- stacking
- ensemble
- lightgbm
- xgboost
- catboost
- optuna
- openml
datasets:
- adult
metrics:
- roc_auc
- accuracy
language:
- en
---

# Stacked GBM Ensemble for Income Classification (OpenML Task 7592)

Weighted ensemble of LightGBM, XGBoost, and CatBoost trained on the Adult Income dataset (UCI / OpenML task 7592). Hyperparameters optimised with Optuna (105 trials, TPE sampler). Evaluated under the standard 10-fold stratified CV protocol defined by OpenML.

**Results outperform the best recorded run on the OpenML leaderboard** (AdaBoost, 2017).

| Model | AUC-ROC | Accuracy |
|---|---|---|
| This ensemble | **0.9315** | **0.8760** |
| OpenML best (AdaBoost, 2017) | 0.9284 | 0.8740 |
| LightGBM alone | 0.9301 | β€” |
| XGBoost alone | 0.9302 | β€” |
| CatBoost alone | 0.9310 | β€” |

---

## Method

**Features (28 total).** Six raw numeric features augmented with log-transformed capital variables, binary flags, age/hours bins, and two interaction terms (`education-num Γ— age`, `education-num Γ— hours-per-week`). Categorical columns encoded with `OrdinalEncoder` for LightGBM/XGBoost; CatBoost receives them natively.

**Ensemble.** Out-of-fold predictions from the three base learners are combined with fixed weights (LGB 0.1 / XGB 0.3 / CB 0.6). Decision threshold tuned on OOF predictions (0.512).

**Tuning.** Optuna TPE, 3-fold inner CV: 40 trials for LightGBM, 40 for XGBoost, 25 for CatBoost.

### Optimised hyperparameters

```python
LGB  = {"n_estimators": 1118, "learning_rate": 0.0115, "num_leaves": 90,  "max_depth": 6}
XGB  = {"n_estimators":  941, "learning_rate": 0.0488, "max_depth": 6,    "gamma": 0.518}
CB   = {"iterations":    778, "learning_rate": 0.0938, "depth": 4,        "l2_leaf_reg": 0.057}
```

---

## Cross-validation results (10-fold)

```
Fold  1  0.9270
Fold  2  0.9299
Fold  3  0.9319
Fold  4  0.9295
Fold  5  0.9293
Fold  6  0.9351
Fold  7  0.9368
Fold  8  0.9300
Fold  9  0.9342
Fold 10  0.9295
──────────────
Mean  0.9313 Β± 0.0029
```

---

## Usage

```python
import joblib, catboost as cb
from huggingface_hub import hf_hub_download

lgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "lgb_model.pkl"))
xgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "xgb_model.pkl"))
encoder   = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "ordinal_encoder.pkl"))
cb_model  = cb.CatBoostClassifier()
cb_model.load_model(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "cb_model.cbm"))

# Build X_enc (28 features) and X_cb_df (21 cols, native categoricals) β€” see train.py
proba  = 0.1 * lgb_model.predict_proba(X_enc)[:, 1] \
       + 0.3 * xgb_model.predict_proba(X_enc)[:, 1] \
       + 0.6 * cb_model.predict_proba(X_cb_df)[:, 1]
labels = (proba >= 0.512).astype(int)  # 1 β†’ >50K
```

Full preprocessing pipeline in `train.py`.

---

## Repository contents

| File | Description |
|---|---|
| `lgb_model.pkl` | LightGBM classifier (full dataset) |
| `xgb_model.pkl` | XGBoost classifier (full dataset) |
| `cb_model.cbm` | CatBoost classifier (native format) |
| `ordinal_encoder.pkl` | Fitted sklearn OrdinalEncoder |
| `train.py` | Reproducible training script |
| `metadata.json` | Results and hyperparameters |

---

## Citation

```bibtex
@misc{aurelPx2026incomeclassifier,
  author = {AurelPx},
  title  = {Stacked GBM Ensemble for Income Classification (OpenML Task 7592)},
  year   = {2026},
  url    = {https://huggingface.co/AurelPx/BoostingEnsemble-Income-Classification}
}
```