Stacked GBM Ensemble for Income Classification (OpenML Task 7592)
Weighted ensemble of LightGBM, XGBoost, and CatBoost trained on the Adult Income dataset (UCI / OpenML task 7592). Hyperparameters optimised with Optuna (105 trials, TPE sampler). Evaluated under the standard 10-fold stratified CV protocol defined by OpenML.
Results outperform the best recorded run on the OpenML leaderboard (AdaBoost, 2017).
| Model | AUC-ROC | Accuracy |
|---|---|---|
| This ensemble | 0.9315 | 0.8760 |
| OpenML best (AdaBoost, 2017) | 0.9284 | 0.8740 |
| LightGBM alone | 0.9301 | β |
| XGBoost alone | 0.9302 | β |
| CatBoost alone | 0.9310 | β |
Method
Features (28 total). Six raw numeric features augmented with log-transformed capital variables, binary flags, age/hours bins, and two interaction terms (education-num Γ age, education-num Γ hours-per-week). Categorical columns encoded with OrdinalEncoder for LightGBM/XGBoost; CatBoost receives them natively.
Ensemble. Out-of-fold predictions from the three base learners are combined with fixed weights (LGB 0.1 / XGB 0.3 / CB 0.6). Decision threshold tuned on OOF predictions (0.512).
Tuning. Optuna TPE, 3-fold inner CV: 40 trials for LightGBM, 40 for XGBoost, 25 for CatBoost.
Optimised hyperparameters
LGB = {"n_estimators": 1118, "learning_rate": 0.0115, "num_leaves": 90, "max_depth": 6}
XGB = {"n_estimators": 941, "learning_rate": 0.0488, "max_depth": 6, "gamma": 0.518}
CB = {"iterations": 778, "learning_rate": 0.0938, "depth": 4, "l2_leaf_reg": 0.057}
Cross-validation results (10-fold)
Fold 1 0.9270
Fold 2 0.9299
Fold 3 0.9319
Fold 4 0.9295
Fold 5 0.9293
Fold 6 0.9351
Fold 7 0.9368
Fold 8 0.9300
Fold 9 0.9342
Fold 10 0.9295
ββββββββββββββ
Mean 0.9313 Β± 0.0029
Usage
import joblib, catboost as cb
from huggingface_hub import hf_hub_download
lgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "lgb_model.pkl"))
xgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "xgb_model.pkl"))
encoder = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "ordinal_encoder.pkl"))
cb_model = cb.CatBoostClassifier()
cb_model.load_model(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "cb_model.cbm"))
# Build X_enc (28 features) and X_cb_df (21 cols, native categoricals) β see train.py
proba = 0.1 * lgb_model.predict_proba(X_enc)[:, 1] \
+ 0.3 * xgb_model.predict_proba(X_enc)[:, 1] \
+ 0.6 * cb_model.predict_proba(X_cb_df)[:, 1]
labels = (proba >= 0.512).astype(int) # 1 β >50K
Full preprocessing pipeline in train.py.
Repository contents
| File | Description |
|---|---|
lgb_model.pkl |
LightGBM classifier (full dataset) |
xgb_model.pkl |
XGBoost classifier (full dataset) |
cb_model.cbm |
CatBoost classifier (native format) |
ordinal_encoder.pkl |
Fitted sklearn OrdinalEncoder |
train.py |
Reproducible training script |
metadata.json |
Results and hyperparameters |
Citation
@misc{aurelPx2026incomeclassifier,
author = {AurelPx},
title = {Stacked GBM Ensemble for Income Classification (OpenML Task 7592)},
year = {2026},
url = {https://huggingface.co/AurelPx/BoostingEnsemble-Income-Classification}
}