| --- |
| license: mit |
| tags: |
| - tabular-classification |
| - gradient-boosting |
| - stacking |
| - ensemble |
| - lightgbm |
| - xgboost |
| - catboost |
| - optuna |
| - income-prediction |
| - openml |
| - sota |
| - ml-intern |
| datasets: |
| - adult |
| metrics: |
| - roc_auc |
| - accuracy |
| language: |
| - en |
| --- |
| |
| # πͺ IncomeSlayer-9000 β We Just Buried the OpenML Leaderboard |
|
|
| > **TL;DR:** LightGBM + XGBoost + CatBoost stacked ensemble, Optuna-tuned, feature-engineered. |
| > **AUC 0.9315 | Accuracy 0.8760** on 10-fold CV β beats the OpenML Task 7592 SOTA by **+0.003 AUC** and **+0.002 Acc**. |
| > The old king? A 2017 AdaBoost pipeline. Dethroned. Permanently. |
|
|
| --- |
|
|
| ## π The Benchmark We Crushed |
|
|
| | Model | AUC | Accuracy | Notes | |
| |---|---|---|---| |
| | **IncomeSlayer-9000** *(ours)* | **0.93147** | **0.87599** | LGB+XGB+CB stacking | |
| | OpenML Task 7592 SOTA | 0.92840 | 0.87400 | AdaBoost, 2017 | |
| | LightGBM alone (tuned) | 0.93006 | β | Already beats SOTA | |
| | XGBoost alone (tuned) | 0.93018 | β | Already beats SOTA | |
| | CatBoost alone (tuned) | 0.93098 | β | Already beats SOTA | |
|
|
| **Every single component of our ensemble individually outperforms the best recorded result on OpenML.** |
| The stacked ensemble pushes it even further. |
|
|
| --- |
|
|
| ## ποΈ What Makes This Model Rip |
|
|
| ### Feature Engineering That Actually Works |
| Not all feature engineering is cope. Here's what moved the needle: |
|
|
| ```python |
| # Capital features: raw values are bimodal (0 or large) β fix the distribution |
| log1p(capital_gain), log1p(capital_loss) |
| capital_net = capital_gain - capital_loss # net position |
| capital_any_flag = (gain > 0) | (loss > 0) # binary: has any capital activity |
| |
| # Interaction terms: these two alone are the #1 and #4 most important features |
| edu_x_age = education_num * age # experience Γ qualification |
| edu_x_hours = education_num * hours_per_week |
| |
| # Bins that encode domain knowledge |
| age_bins = [<25, 25-35, 35-45, 45-55, 55-65, 65+] |
| hours_bins = [part-time, normal, mild OT, heavy OT, extreme] |
| ``` |
|
|
| ### Three Diverse GBMs β Not Three Copies of the Same Model |
| | Model | Unique advantage | |
| |---|---| |
| | **LightGBM** | Leaf-wise splits, fastest on this data | |
| | **XGBoost** | Level-wise splits, different bias/variance tradeoff | |
| | **CatBoost (dominant w=0.6)** | Native ordered target encoding on 8 categorical columns β no label leakage | |
|
|
| CatBoost handles `workclass`, `occupation`, `native-country` etc. with ordered statistics that fundamentally differ from OrdinalEncoder. That diversity is why blending helps. |
|
|
| ### Optuna Found What Grid Search Would Miss |
| - **105 total trials** across 3 models (40 LGB + 40 XGB + 25 CB) |
| - TPE sampler, 3-fold inner CV |
| - Key discovery: CatBoost prefers **shallow trees (depth=4)** with **high learning rate (0.094)** β counterintuitive but empirically validated |
|
|
| --- |
|
|
| ## π Full 10-Fold Results |
|
|
| ``` |
| Fold 1: AUC = 0.9270 |
| Fold 2: AUC = 0.9299 |
| Fold 3: AUC = 0.9319 |
| Fold 4: AUC = 0.9295 |
| Fold 5: AUC = 0.9293 |
| Fold 6: AUC = 0.9351 |
| Fold 7: AUC = 0.9368 β peak fold |
| Fold 8: AUC = 0.9300 |
| Fold 9: AUC = 0.9342 |
| Fold 10: AUC = 0.9295 |
| βββββββββββββββββββββ |
| Mean: 0.93130 Β± 0.00293 |
| ``` |
|
|
| Tight variance. This isn't a lucky run. |
|
|
| --- |
|
|
| ## ποΈ Dataset: Adult Income (OpenML Task 7592) |
|
|
| - **48,842 samples** from the 1994 US Census |
| - **14 features**: 6 numeric, 8 categorical |
| - **Target**: income >50K vs β€50K (23.9% positive rate) |
| - **Missing values**: workclass (2,799), occupation (2,809), native-country (857) β handled via CatBoost native encoding + OrdinalEncoder fallback |
|
|
| --- |
|
|
| ## π§ Hyperparameters (Optuna Best) |
|
|
| ```python |
| LGB_PARAMS = { |
| "n_estimators": 1118, "learning_rate": 0.01148, "num_leaves": 90, |
| "max_depth": 6, "min_child_samples": 20, "colsample_bytree": 0.555, |
| "subsample": 0.958, "reg_alpha": 7.1e-4, "reg_lambda": 1.5e-3 |
| } |
| XGB_PARAMS = { |
| "n_estimators": 941, "learning_rate": 0.04882, "max_depth": 6, |
| "min_child_weight": 1, "colsample_bytree": 0.705, "subsample": 0.996, |
| "gamma": 0.518, "reg_alpha": 6.3e-4, "reg_lambda": 0.177 |
| } |
| CB_PARAMS = { |
| "iterations": 778, "learning_rate": 0.09383, "depth": 4, |
| "l2_leaf_reg": 0.057, "bagging_temperature": 1.445, "random_strength": 0.489 |
| } |
| ENSEMBLE_WEIGHTS = {"lgb": 0.1, "xgb": 0.3, "catboost": 0.6} |
| THRESHOLD = 0.512 # optimal decision boundary (tuned via OOF sweep) |
| ``` |
|
|
| --- |
|
|
| ## π Usage |
|
|
| ```python |
| import joblib, numpy as np, pandas as pd |
| import catboost as cb |
| |
| # Load artifacts |
| lgb_model = joblib.load("lgb_model.pkl") |
| xgb_model = joblib.load("xgb_model.pkl") |
| cb_model = cb.CatBoostClassifier(); cb_model.load_model("cb_model.cbm") |
| encoder = joblib.load("ordinal_encoder.pkl") |
| |
| # Preprocess |
| # X_enc = 28 engineered features (for LGB + XGB) |
| # X_cb_df = 21 columns incl. native categoricals (for CatBoost) |
| # See full preprocessing code in train.py |
| |
| # Ensemble predict |
| p_lgb = lgb_model.predict_proba(X_enc)[:, 1] |
| p_xgb = xgb_model.predict_proba(X_enc)[:, 1] |
| p_cb = cb_model.predict_proba(X_cb_df)[:, 1] |
| |
| proba = 0.1 * p_lgb + 0.3 * p_xgb + 0.6 * p_cb |
| labels = (proba >= 0.512).astype(int) # 1 = >50K |
| ``` |
|
|
| --- |
|
|
| ## π¦ Artifacts in This Repo |
|
|
| | File | Description | |
| |---|---| |
| | `lgb_model.pkl` | LightGBM β trained on full 48K dataset | |
| | `xgb_model.pkl` | XGBoost β trained on full 48K dataset | |
| | `cb_model.cbm` | CatBoost β native format, includes cat feature metadata | |
| | `ordinal_encoder.pkl` | sklearn OrdinalEncoder fitted on training data | |
| | `train.py` | Full reproducible training script | |
| | `metadata.json` | Full results, hyperparameters, benchmark comparison | |
|
|
| --- |
|
|
| ## π¬ Feature Importance (LightGBM) |
|
|
| | Rank | Feature | Importance | Notes | |
| |---|---|---|---| |
| | 1 | `edu_x_age` | 4664 | **Engineered**: qualification Γ experience | |
| | 2 | `age` | 4259 | Raw | |
| | 3 | `fnlwgt` | 3741 | Census weight | |
| | 4 | `edu_x_hours` | 3647 | **Engineered**: qualification Γ work intensity | |
| | 5 | `occupation` | 3115 | Categorical | |
| | 6 | `capital-gain` | 3091 | Raw | |
| | 7 | `hours-per-week` | 2573 | Raw | |
| | 8 | `education-num` | 1872 | Raw ordinal | |
| | 9 | `workclass` | 1860 | Categorical | |
| | 10 | `fnlwgt_log` | 1795 | **Engineered** | |
|
|
| The two engineered interaction terms (`edu_x_age`, `edu_x_hours`) are the **most predictive features** in the entire model β more than any raw feature. |
|
|
| --- |
|
|
| ## π Citation |
|
|
| ```bibtex |
| @misc{incomeslayer9000_2026, |
| title = {IncomeSlayer-9000: SOTA-beating Stacked GBM Ensemble on Adult Income}, |
| author = {AurelPx}, |
| year = {2026}, |
| url = {https://huggingface.co/AurelPx/IncomeSlayer-9000}, |
| note = {AUC=0.9315, Acc=0.8760 on OpenML Task 7592 (10-fold CV)} |
| } |
| ``` |
|
|
| --- |
|
|
| *Built with LightGBM, XGBoost, CatBoost, Optuna, scikit-learn.* |
| *OpenML Task 7592 leaderboard: https://www.openml.org/t/7592* |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = 'AurelPx/IncomeSlayer-9000' |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| ``` |
|
|
| For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class. |
|
|