File size: 3,671 Bytes
fb6d066 140365e fb6d066 140365e fb6d066 66cd2ef fb6d066 66cd2ef fb6d066 66cd2ef fb6d066 66cd2ef 140365e 66cd2ef 140365e 66cd2ef fb6d066 66cd2ef 140365e 66cd2ef 140365e 66cd2ef 140365e 66cd2ef 140365e fb6d066 66cd2ef 140365e 66cd2ef 140365e 66cd2ef 140365e 66cd2ef 0ba0075 66cd2ef 0ba0075 66cd2ef fb6d066 66cd2ef 140365e 66cd2ef 140365e 66cd2ef 140365e 66cd2ef 140365e 66cd2ef 140365e 66cd2ef 140365e 0ba0075 140365e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | ---
license: mit
tags:
- tabular-classification
- gradient-boosting
- stacking
- ensemble
- lightgbm
- xgboost
- catboost
- optuna
- openml
datasets:
- adult
metrics:
- roc_auc
- accuracy
language:
- en
---
# Stacked GBM Ensemble for Income Classification (OpenML Task 7592)
Weighted ensemble of LightGBM, XGBoost, and CatBoost trained on the Adult Income dataset (UCI / OpenML task 7592). Hyperparameters optimised with Optuna (105 trials, TPE sampler). Evaluated under the standard 10-fold stratified CV protocol defined by OpenML.
**Results outperform the best recorded run on the OpenML leaderboard** (AdaBoost, 2017).
| Model | AUC-ROC | Accuracy |
|---|---|---|
| This ensemble | **0.9315** | **0.8760** |
| OpenML best (AdaBoost, 2017) | 0.9284 | 0.8740 |
| LightGBM alone | 0.9301 | β |
| XGBoost alone | 0.9302 | β |
| CatBoost alone | 0.9310 | β |
---
## Method
**Features (28 total).** Six raw numeric features augmented with log-transformed capital variables, binary flags, age/hours bins, and two interaction terms (`education-num Γ age`, `education-num Γ hours-per-week`). Categorical columns encoded with `OrdinalEncoder` for LightGBM/XGBoost; CatBoost receives them natively.
**Ensemble.** Out-of-fold predictions from the three base learners are combined with fixed weights (LGB 0.1 / XGB 0.3 / CB 0.6). Decision threshold tuned on OOF predictions (0.512).
**Tuning.** Optuna TPE, 3-fold inner CV: 40 trials for LightGBM, 40 for XGBoost, 25 for CatBoost.
### Optimised hyperparameters
```python
LGB = {"n_estimators": 1118, "learning_rate": 0.0115, "num_leaves": 90, "max_depth": 6}
XGB = {"n_estimators": 941, "learning_rate": 0.0488, "max_depth": 6, "gamma": 0.518}
CB = {"iterations": 778, "learning_rate": 0.0938, "depth": 4, "l2_leaf_reg": 0.057}
```
---
## Cross-validation results (10-fold)
```
Fold 1 0.9270
Fold 2 0.9299
Fold 3 0.9319
Fold 4 0.9295
Fold 5 0.9293
Fold 6 0.9351
Fold 7 0.9368
Fold 8 0.9300
Fold 9 0.9342
Fold 10 0.9295
ββββββββββββββ
Mean 0.9313 Β± 0.0029
```
---
## Usage
```python
import joblib, catboost as cb
from huggingface_hub import hf_hub_download
lgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "lgb_model.pkl"))
xgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "xgb_model.pkl"))
encoder = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "ordinal_encoder.pkl"))
cb_model = cb.CatBoostClassifier()
cb_model.load_model(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "cb_model.cbm"))
# Build X_enc (28 features) and X_cb_df (21 cols, native categoricals) β see train.py
proba = 0.1 * lgb_model.predict_proba(X_enc)[:, 1] \
+ 0.3 * xgb_model.predict_proba(X_enc)[:, 1] \
+ 0.6 * cb_model.predict_proba(X_cb_df)[:, 1]
labels = (proba >= 0.512).astype(int) # 1 β >50K
```
Full preprocessing pipeline in `train.py`.
---
## Repository contents
| File | Description |
|---|---|
| `lgb_model.pkl` | LightGBM classifier (full dataset) |
| `xgb_model.pkl` | XGBoost classifier (full dataset) |
| `cb_model.cbm` | CatBoost classifier (native format) |
| `ordinal_encoder.pkl` | Fitted sklearn OrdinalEncoder |
| `train.py` | Reproducible training script |
| `metadata.json` | Results and hyperparameters |
---
## Citation
```bibtex
@misc{aurelPx2026incomeclassifier,
author = {AurelPx},
title = {Stacked GBM Ensemble for Income Classification (OpenML Task 7592)},
year = {2026},
url = {https://huggingface.co/AurelPx/BoostingEnsemble-Income-Classification}
}
```
|