AurelPx
/

BoostingEnsemble-Income-Classification

@@ -9,10 +9,7 @@ tags:
 - xgboost
 - catboost
 - optuna
-- income-prediction
 - openml
-- sota
-- ml-intern
 datasets:
 - adult
 metrics:
@@ -22,211 +19,102 @@ language:
 - en
 ---
-# 🔪 IncomeSlayer-9000 — We Just Buried the OpenML Leaderboard
-> **TL;DR:** LightGBM + XGBoost + CatBoost stacked ensemble, Optuna-tuned, feature-engineered.
-> **AUC 0.9315 | Accuracy 0.8760** on 10-fold CV — beats the OpenML Task 7592 SOTA by **+0.003 AUC** and **+0.002 Acc**.
-> The old king? A 2017 AdaBoost pipeline. Dethroned. Permanently.
----
-## 💀 The Benchmark We Crushed
-| Model | AUC | Accuracy | Notes |
-|---|---|---|---|
-| **IncomeSlayer-9000** *(ours)* | **0.93147** | **0.87599** | LGB+XGB+CB stacking |
-| OpenML Task 7592 SOTA | 0.92840 | 0.87400 | AdaBoost, 2017 |
-| LightGBM alone (tuned) | 0.93006 | — | Already beats SOTA |
-| XGBoost alone (tuned) | 0.93018 | — | Already beats SOTA |
-| CatBoost alone (tuned) | 0.93098 | — | Already beats SOTA |
-**Every single component of our ensemble individually outperforms the best recorded result on OpenML.**
-The stacked ensemble pushes it even further.
 ---
-## 🏋️ What Makes This Model Rip
-### Feature Engineering That Actually Works
-Not all feature engineering is cope. Here's what moved the needle:
-```python
-# Capital features: raw values are bimodal (0 or large) → fix the distribution
-log1p(capital_gain), log1p(capital_loss)
-capital_net = capital_gain - capital_loss   # net position
-capital_any_flag = (gain > 0) | (loss > 0) # binary: has any capital activity
-# Interaction terms: these two alone are the #1 and #4 most important features
-edu_x_age   = education_num * age          # experience × qualification
-edu_x_hours = education_num * hours_per_week
-# Bins that encode domain knowledge
-age_bins = [<25, 25-35, 35-45, 45-55, 55-65, 65+]
-hours_bins = [part-time, normal, mild OT, heavy OT, extreme]
-```
-### Three Diverse GBMs — Not Three Copies of the Same Model
-| Model | Unique advantage |
-|---|---|
-| **LightGBM** | Leaf-wise splits, fastest on this data |
-| **XGBoost** | Level-wise splits, different bias/variance tradeoff |
-| **CatBoost (dominant w=0.6)** | Native ordered target encoding on 8 categorical columns — no label leakage |
-CatBoost handles `workclass`, `occupation`, `native-country` etc. with ordered statistics that fundamentally differ from OrdinalEncoder. That diversity is why blending helps.
-### Optuna Found What Grid Search Would Miss
-- **105 total trials** across 3 models (40 LGB + 40 XGB + 25 CB)
-- TPE sampler, 3-fold inner CV
-- Key discovery: CatBoost prefers **shallow trees (depth=4)** with **high learning rate (0.094)** — counterintuitive but empirically validated
 ---
-## 📊 Full 10-Fold Results
 ```
-Fold  1: AUC = 0.9270
-Fold  2: AUC = 0.9299
-Fold  3: AUC = 0.9319
-Fold  4: AUC = 0.9295
-Fold  5: AUC = 0.9293
-Fold  6: AUC = 0.9351
-Fold  7: AUC = 0.9368  ← peak fold
-Fold  8: AUC = 0.9300
-Fold  9: AUC = 0.9342
-Fold 10: AUC = 0.9295
-─────────────────────
-Mean:  0.93130 ± 0.00293
 ```
-Tight variance. This isn't a lucky run.
----
-## 🗂️ Dataset: Adult Income (OpenML Task 7592)
-- **48,842 samples** from the 1994 US Census
-- **14 features**: 6 numeric, 8 categorical
-- **Target**: income >50K vs ≤50K (23.9% positive rate)
-- **Missing values**: workclass (2,799), occupation (2,809), native-country (857) — handled via CatBoost native encoding + OrdinalEncoder fallback
 ---
-## 🔧 Hyperparameters (Optuna Best)
 ```python
-LGB_PARAMS = {
-    "n_estimators": 1118, "learning_rate": 0.01148, "num_leaves": 90,
-    "max_depth": 6, "min_child_samples": 20, "colsample_bytree": 0.555,
-    "subsample": 0.958, "reg_alpha": 7.1e-4, "reg_lambda": 1.5e-3
-}
-XGB_PARAMS = {
-    "n_estimators": 941, "learning_rate": 0.04882, "max_depth": 6,
-    "min_child_weight": 1, "colsample_bytree": 0.705, "subsample": 0.996,
-    "gamma": 0.518, "reg_alpha": 6.3e-4, "reg_lambda": 0.177
-}
-CB_PARAMS = {
-    "iterations": 778, "learning_rate": 0.09383, "depth": 4,
-    "l2_leaf_reg": 0.057, "bagging_temperature": 1.445, "random_strength": 0.489
-}
-ENSEMBLE_WEIGHTS = {"lgb": 0.1, "xgb": 0.3, "catboost": 0.6}
-THRESHOLD = 0.512  # optimal decision boundary (tuned via OOF sweep)
 ```
----
-## 🚀 Usage
-```python
-import joblib, numpy as np, pandas as pd
-import catboost as cb
-# Load artifacts
-lgb_model = joblib.load("lgb_model.pkl")
-xgb_model = joblib.load("xgb_model.pkl")
-cb_model  = cb.CatBoostClassifier(); cb_model.load_model("cb_model.cbm")
-encoder   = joblib.load("ordinal_encoder.pkl")
-# Preprocess
-# X_enc  = 28 engineered features (for LGB + XGB)
-# X_cb_df = 21 columns incl. native categoricals (for CatBoost)
-# See full preprocessing code in train.py
-# Ensemble predict
-p_lgb = lgb_model.predict_proba(X_enc)[:, 1]
-p_xgb = xgb_model.predict_proba(X_enc)[:, 1]
-p_cb  = cb_model.predict_proba(X_cb_df)[:, 1]
-proba  = 0.1 * p_lgb + 0.3 * p_xgb + 0.6 * p_cb
-labels = (proba >= 0.512).astype(int)  # 1 = >50K
-```
 ---
-## 📦 Artifacts in This Repo
 | File | Description |
 |---|---|
-| `lgb_model.pkl` | LightGBM — trained on full 48K dataset |
-| `xgb_model.pkl` | XGBoost — trained on full 48K dataset |
-| `cb_model.cbm` | CatBoost — native format, includes cat feature metadata |
-| `ordinal_encoder.pkl` | sklearn OrdinalEncoder fitted on training data |
-| `train.py` | Full reproducible training script |
-| `metadata.json` | Full results, hyperparameters, benchmark comparison |
----
-## 🔬 Feature Importance (LightGBM)
-| Rank | Feature | Importance | Notes |
-|---|---|---|---|
-| 1 | `edu_x_age` | 4664 | **Engineered**: qualification × experience |
-| 2 | `age` | 4259 | Raw |
-| 3 | `fnlwgt` | 3741 | Census weight |
-| 4 | `edu_x_hours` | 3647 | **Engineered**: qualification × work intensity |
-| 5 | `occupation` | 3115 | Categorical |
-| 6 | `capital-gain` | 3091 | Raw |
-| 7 | `hours-per-week` | 2573 | Raw |
-| 8 | `education-num` | 1872 | Raw ordinal |
-| 9 | `workclass` | 1860 | Categorical |
-| 10 | `fnlwgt_log` | 1795 | **Engineered** |
-The two engineered interaction terms (`edu_x_age`, `edu_x_hours`) are the **most predictive features** in the entire model — more than any raw feature.
 ---
-## 📝 Citation
 ```bibtex
-@misc{incomeslayer9000_2026,
-  title  = {IncomeSlayer-9000: SOTA-beating Stacked GBM Ensemble on Adult Income},
   author = {AurelPx},
   year   = {2026},
-  url    = {https://huggingface.co/AurelPx/IncomeSlayer-9000},
-  note   = {AUC=0.9315, Acc=0.8760 on OpenML Task 7592 (10-fold CV)}
 }
 ```
----
-*Built with LightGBM, XGBoost, CatBoost, Optuna, scikit-learn.*
-*OpenML Task 7592 leaderboard: https://www.openml.org/t/7592*
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern
-## Usage
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = 'AurelPx/IncomeSlayer-9000'
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
-```
-For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.

 - xgboost
 - catboost
 - optuna
 - openml
 datasets:
 - adult
 metrics:
 - en
 ---
+# Stacked GBM Ensemble for Income Classification (OpenML Task 7592)
+Weighted ensemble of LightGBM, XGBoost, and CatBoost trained on the Adult Income dataset (UCI / OpenML task 7592). Hyperparameters optimised with Optuna (105 trials, TPE sampler). Evaluated under the standard 10-fold stratified CV protocol defined by OpenML.
+**Results outperform the best recorded run on the OpenML leaderboard** (AdaBoost, 2017).
+| Model | AUC-ROC | Accuracy |
+|---|---|---|
+| This ensemble | **0.9315** | **0.8760** |
+| OpenML best (AdaBoost, 2017) | 0.9284 | 0.8740 |
+| LightGBM alone | 0.9301 | — |
+| XGBoost alone | 0.9302 | — |
+| CatBoost alone | 0.9310 | — |
 ---
+## Method
+**Features (28 total).** Six raw numeric features augmented with log-transformed capital variables, binary flags, age/hours bins, and two interaction terms (`education-num × age`, `education-num × hours-per-week`). Categorical columns encoded with `OrdinalEncoder` for LightGBM/XGBoost; CatBoost receives them natively.
+**Ensemble.** Out-of-fold predictions from the three base learners are combined with fixed weights (LGB 0.1 / XGB 0.3 / CB 0.6). Decision threshold tuned on OOF predictions (0.512).
+**Tuning.** Optuna TPE, 3-fold inner CV: 40 trials for LightGBM, 40 for XGBoost, 25 for CatBoost.
+### Optimised hyperparameters
+```python
+LGB  = {"n_estimators": 1118, "learning_rate": 0.0115, "num_leaves": 90,  "max_depth": 6}
+XGB  = {"n_estimators":  941, "learning_rate": 0.0488, "max_depth": 6,    "gamma": 0.518}
+CB   = {"iterations":    778, "learning_rate": 0.0938, "depth": 4,        "l2_leaf_reg": 0.057}
+```
 ---
+## Cross-validation results (10-fold)
 ```
+Fold  1  0.9270
+Fold  2  0.9299
+Fold  3  0.9319
+Fold  4  0.9295
+Fold  5  0.9293
+Fold  6  0.9351
+Fold  7  0.9368
+Fold  8  0.9300
+Fold  9  0.9342
+Fold 10  0.9295
+──────────────
+Mean  0.9313 ± 0.0029
 ```
 ---
+## Usage
 ```python
+import joblib, catboost as cb
+from huggingface_hub import hf_hub_download
+lgb_model = joblib.load(hf_hub_download("AurelPx/IncomeSlayer-9000", "lgb_model.pkl"))
+xgb_model = joblib.load(hf_hub_download("AurelPx/IncomeSlayer-9000", "xgb_model.pkl"))
+encoder   = joblib.load(hf_hub_download("AurelPx/IncomeSlayer-9000", "ordinal_encoder.pkl"))
+cb_model  = cb.CatBoostClassifier()
+cb_model.load_model(hf_hub_download("AurelPx/IncomeSlayer-9000", "cb_model.cbm"))
+# Build X_enc (28 features) and X_cb_df (21 cols, native categoricals) — see train.py
+proba  = 0.1 * lgb_model.predict_proba(X_enc)[:, 1] \
+       + 0.3 * xgb_model.predict_proba(X_enc)[:, 1] \
+       + 0.6 * cb_model.predict_proba(X_cb_df)[:, 1]
+labels = (proba >= 0.512).astype(int)  # 1 → >50K
 ```
+Full preprocessing pipeline in `train.py`.
 ---
+## Repository contents
 | File | Description |
 |---|---|
+| `lgb_model.pkl` | LightGBM classifier (full dataset) |
+| `xgb_model.pkl` | XGBoost classifier (full dataset) |
+| `cb_model.cbm` | CatBoost classifier (native format) |
+| `ordinal_encoder.pkl` | Fitted sklearn OrdinalEncoder |
+| `train.py` | Reproducible training script |
+| `metadata.json` | Results and hyperparameters |
 ---
+## Citation
 ```bibtex
+@misc{aurelPx2026incomeclassifier,
   author = {AurelPx},
+  title  = {Stacked GBM Ensemble for Income Classification (OpenML Task 7592)},
   year   = {2026},
+  url    = {https://huggingface.co/AurelPx/IncomeSlayer-9000}
 }
 ```