--- license: mit tags: - tabular-classification - gradient-boosting - stacking - ensemble - lightgbm - xgboost - catboost - optuna - openml datasets: - adult metrics: - roc_auc - accuracy language: - en --- # Stacked GBM Ensemble for Income Classification (OpenML Task 7592) Weighted ensemble of LightGBM, XGBoost, and CatBoost trained on the Adult Income dataset (UCI / OpenML task 7592). Hyperparameters optimised with Optuna (105 trials, TPE sampler). Evaluated under the standard 10-fold stratified CV protocol defined by OpenML. **Results outperform the best recorded run on the OpenML leaderboard** (AdaBoost, 2017). | Model | AUC-ROC | Accuracy | |---|---|---| | This ensemble | **0.9315** | **0.8760** | | OpenML best (AdaBoost, 2017) | 0.9284 | 0.8740 | | LightGBM alone | 0.9301 | — | | XGBoost alone | 0.9302 | — | | CatBoost alone | 0.9310 | — | --- ## Method **Features (28 total).** Six raw numeric features augmented with log-transformed capital variables, binary flags, age/hours bins, and two interaction terms (`education-num × age`, `education-num × hours-per-week`). Categorical columns encoded with `OrdinalEncoder` for LightGBM/XGBoost; CatBoost receives them natively. **Ensemble.** Out-of-fold predictions from the three base learners are combined with fixed weights (LGB 0.1 / XGB 0.3 / CB 0.6). Decision threshold tuned on OOF predictions (0.512). **Tuning.** Optuna TPE, 3-fold inner CV: 40 trials for LightGBM, 40 for XGBoost, 25 for CatBoost. ### Optimised hyperparameters ```python LGB = {"n_estimators": 1118, "learning_rate": 0.0115, "num_leaves": 90, "max_depth": 6} XGB = {"n_estimators": 941, "learning_rate": 0.0488, "max_depth": 6, "gamma": 0.518} CB = {"iterations": 778, "learning_rate": 0.0938, "depth": 4, "l2_leaf_reg": 0.057} ``` --- ## Cross-validation results (10-fold) ``` Fold 1 0.9270 Fold 2 0.9299 Fold 3 0.9319 Fold 4 0.9295 Fold 5 0.9293 Fold 6 0.9351 Fold 7 0.9368 Fold 8 0.9300 Fold 9 0.9342 Fold 10 0.9295 ────────────── Mean 0.9313 ± 0.0029 ``` --- ## Usage ```python import joblib, catboost as cb from huggingface_hub import hf_hub_download lgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "lgb_model.pkl")) xgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "xgb_model.pkl")) encoder = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "ordinal_encoder.pkl")) cb_model = cb.CatBoostClassifier() cb_model.load_model(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "cb_model.cbm")) # Build X_enc (28 features) and X_cb_df (21 cols, native categoricals) — see train.py proba = 0.1 * lgb_model.predict_proba(X_enc)[:, 1] \ + 0.3 * xgb_model.predict_proba(X_enc)[:, 1] \ + 0.6 * cb_model.predict_proba(X_cb_df)[:, 1] labels = (proba >= 0.512).astype(int) # 1 → >50K ``` Full preprocessing pipeline in `train.py`. --- ## Repository contents | File | Description | |---|---| | `lgb_model.pkl` | LightGBM classifier (full dataset) | | `xgb_model.pkl` | XGBoost classifier (full dataset) | | `cb_model.cbm` | CatBoost classifier (native format) | | `ordinal_encoder.pkl` | Fitted sklearn OrdinalEncoder | | `train.py` | Reproducible training script | | `metadata.json` | Results and hyperparameters | --- ## Citation ```bibtex @misc{aurelPx2026incomeclassifier, author = {AurelPx}, title = {Stacked GBM Ensemble for Income Classification (OpenML Task 7592)}, year = {2026}, url = {https://huggingface.co/AurelPx/BoostingEnsemble-Income-Classification} } ```