AurelPx's picture
Rewrite model card: minimalist scientific tone
66cd2ef verified
|
raw
history blame
3.57 kB
metadata
license: mit
tags:
  - tabular-classification
  - gradient-boosting
  - stacking
  - ensemble
  - lightgbm
  - xgboost
  - catboost
  - optuna
  - openml
datasets:
  - adult
metrics:
  - roc_auc
  - accuracy
language:
  - en

Stacked GBM Ensemble for Income Classification (OpenML Task 7592)

Weighted ensemble of LightGBM, XGBoost, and CatBoost trained on the Adult Income dataset (UCI / OpenML task 7592). Hyperparameters optimised with Optuna (105 trials, TPE sampler). Evaluated under the standard 10-fold stratified CV protocol defined by OpenML.

Results outperform the best recorded run on the OpenML leaderboard (AdaBoost, 2017).

Model AUC-ROC Accuracy
This ensemble 0.9315 0.8760
OpenML best (AdaBoost, 2017) 0.9284 0.8740
LightGBM alone 0.9301 β€”
XGBoost alone 0.9302 β€”
CatBoost alone 0.9310 β€”

Method

Features (28 total). Six raw numeric features augmented with log-transformed capital variables, binary flags, age/hours bins, and two interaction terms (education-num Γ— age, education-num Γ— hours-per-week). Categorical columns encoded with OrdinalEncoder for LightGBM/XGBoost; CatBoost receives them natively.

Ensemble. Out-of-fold predictions from the three base learners are combined with fixed weights (LGB 0.1 / XGB 0.3 / CB 0.6). Decision threshold tuned on OOF predictions (0.512).

Tuning. Optuna TPE, 3-fold inner CV: 40 trials for LightGBM, 40 for XGBoost, 25 for CatBoost.

Optimised hyperparameters

LGB  = {"n_estimators": 1118, "learning_rate": 0.0115, "num_leaves": 90,  "max_depth": 6}
XGB  = {"n_estimators":  941, "learning_rate": 0.0488, "max_depth": 6,    "gamma": 0.518}
CB   = {"iterations":    778, "learning_rate": 0.0938, "depth": 4,        "l2_leaf_reg": 0.057}

Cross-validation results (10-fold)

Fold  1  0.9270
Fold  2  0.9299
Fold  3  0.9319
Fold  4  0.9295
Fold  5  0.9293
Fold  6  0.9351
Fold  7  0.9368
Fold  8  0.9300
Fold  9  0.9342
Fold 10  0.9295
──────────────
Mean  0.9313 Β± 0.0029

Usage

import joblib, catboost as cb
from huggingface_hub import hf_hub_download

lgb_model = joblib.load(hf_hub_download("AurelPx/IncomeSlayer-9000", "lgb_model.pkl"))
xgb_model = joblib.load(hf_hub_download("AurelPx/IncomeSlayer-9000", "xgb_model.pkl"))
encoder   = joblib.load(hf_hub_download("AurelPx/IncomeSlayer-9000", "ordinal_encoder.pkl"))
cb_model  = cb.CatBoostClassifier()
cb_model.load_model(hf_hub_download("AurelPx/IncomeSlayer-9000", "cb_model.cbm"))

# Build X_enc (28 features) and X_cb_df (21 cols, native categoricals) β€” see train.py
proba  = 0.1 * lgb_model.predict_proba(X_enc)[:, 1] \
       + 0.3 * xgb_model.predict_proba(X_enc)[:, 1] \
       + 0.6 * cb_model.predict_proba(X_cb_df)[:, 1]
labels = (proba >= 0.512).astype(int)  # 1 β†’ >50K

Full preprocessing pipeline in train.py.


Repository contents

File Description
lgb_model.pkl LightGBM classifier (full dataset)
xgb_model.pkl XGBoost classifier (full dataset)
cb_model.cbm CatBoost classifier (native format)
ordinal_encoder.pkl Fitted sklearn OrdinalEncoder
train.py Reproducible training script
metadata.json Results and hyperparameters

Citation

@misc{aurelPx2026incomeclassifier,
  author = {AurelPx},
  title  = {Stacked GBM Ensemble for Income Classification (OpenML Task 7592)},
  year   = {2026},
  url    = {https://huggingface.co/AurelPx/IncomeSlayer-9000}
}