| --- |
| license: mit |
| tags: |
| - tabular-classification |
| - gradient-boosting |
| - stacking |
| - ensemble |
| - lightgbm |
| - xgboost |
| - catboost |
| - optuna |
| - openml |
| datasets: |
| - adult |
| metrics: |
| - roc_auc |
| - accuracy |
| language: |
| - en |
| --- |
| |
| # Stacked GBM Ensemble for Income Classification (OpenML Task 7592) |
|
|
| Weighted ensemble of LightGBM, XGBoost, and CatBoost trained on the Adult Income dataset (UCI / OpenML task 7592). Hyperparameters optimised with Optuna (105 trials, TPE sampler). Evaluated under the standard 10-fold stratified CV protocol defined by OpenML. |
|
|
| **Results outperform the best recorded run on the OpenML leaderboard** (AdaBoost, 2017). |
|
|
| | Model | AUC-ROC | Accuracy | |
| |---|---|---| |
| | This ensemble | **0.9315** | **0.8760** | |
| | OpenML best (AdaBoost, 2017) | 0.9284 | 0.8740 | |
| | LightGBM alone | 0.9301 | β | |
| | XGBoost alone | 0.9302 | β | |
| | CatBoost alone | 0.9310 | β | |
|
|
| --- |
|
|
| ## Method |
|
|
| **Features (28 total).** Six raw numeric features augmented with log-transformed capital variables, binary flags, age/hours bins, and two interaction terms (`education-num Γ age`, `education-num Γ hours-per-week`). Categorical columns encoded with `OrdinalEncoder` for LightGBM/XGBoost; CatBoost receives them natively. |
|
|
| **Ensemble.** Out-of-fold predictions from the three base learners are combined with fixed weights (LGB 0.1 / XGB 0.3 / CB 0.6). Decision threshold tuned on OOF predictions (0.512). |
|
|
| **Tuning.** Optuna TPE, 3-fold inner CV: 40 trials for LightGBM, 40 for XGBoost, 25 for CatBoost. |
|
|
| ### Optimised hyperparameters |
|
|
| ```python |
| LGB = {"n_estimators": 1118, "learning_rate": 0.0115, "num_leaves": 90, "max_depth": 6} |
| XGB = {"n_estimators": 941, "learning_rate": 0.0488, "max_depth": 6, "gamma": 0.518} |
| CB = {"iterations": 778, "learning_rate": 0.0938, "depth": 4, "l2_leaf_reg": 0.057} |
| ``` |
|
|
| --- |
|
|
| ## Cross-validation results (10-fold) |
|
|
| ``` |
| Fold 1 0.9270 |
| Fold 2 0.9299 |
| Fold 3 0.9319 |
| Fold 4 0.9295 |
| Fold 5 0.9293 |
| Fold 6 0.9351 |
| Fold 7 0.9368 |
| Fold 8 0.9300 |
| Fold 9 0.9342 |
| Fold 10 0.9295 |
| ββββββββββββββ |
| Mean 0.9313 Β± 0.0029 |
| ``` |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| import joblib, catboost as cb |
| from huggingface_hub import hf_hub_download |
| |
| lgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "lgb_model.pkl")) |
| xgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "xgb_model.pkl")) |
| encoder = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "ordinal_encoder.pkl")) |
| cb_model = cb.CatBoostClassifier() |
| cb_model.load_model(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "cb_model.cbm")) |
| |
| # Build X_enc (28 features) and X_cb_df (21 cols, native categoricals) β see train.py |
| proba = 0.1 * lgb_model.predict_proba(X_enc)[:, 1] \ |
| + 0.3 * xgb_model.predict_proba(X_enc)[:, 1] \ |
| + 0.6 * cb_model.predict_proba(X_cb_df)[:, 1] |
| labels = (proba >= 0.512).astype(int) # 1 β >50K |
| ``` |
|
|
| Full preprocessing pipeline in `train.py`. |
|
|
| --- |
|
|
| ## Repository contents |
|
|
| | File | Description | |
| |---|---| |
| | `lgb_model.pkl` | LightGBM classifier (full dataset) | |
| | `xgb_model.pkl` | XGBoost classifier (full dataset) | |
| | `cb_model.cbm` | CatBoost classifier (native format) | |
| | `ordinal_encoder.pkl` | Fitted sklearn OrdinalEncoder | |
| | `train.py` | Reproducible training script | |
| | `metadata.json` | Results and hyperparameters | |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{aurelPx2026incomeclassifier, |
| author = {AurelPx}, |
| title = {Stacked GBM Ensemble for Income Classification (OpenML Task 7592)}, |
| year = {2026}, |
| url = {https://huggingface.co/AurelPx/BoostingEnsemble-Income-Classification} |
| } |
| ``` |
|
|