Update README.md

0ba0075 verified about 22 hours ago

3.67 kB

	---
	license: mit
	tags:
	- tabular-classification
	- gradient-boosting
	- stacking
	- ensemble
	- lightgbm
	- xgboost
	- catboost
	- optuna
	- openml
	datasets:
	- adult
	metrics:
	- roc_auc
	- accuracy
	language:
	- en
	---

	# Stacked GBM Ensemble for Income Classification (OpenML Task 7592)

	Weighted ensemble of LightGBM, XGBoost, and CatBoost trained on the Adult Income dataset (UCI / OpenML task 7592). Hyperparameters optimised with Optuna (105 trials, TPE sampler). Evaluated under the standard 10-fold stratified CV protocol defined by OpenML.

	Results outperform the best recorded run on the OpenML leaderboard (AdaBoost, 2017).

	\| Model \| AUC-ROC \| Accuracy \|
	\|---\|---\|---\|
	\| This ensemble \| 0.9315 \| 0.8760 \|
	\| OpenML best (AdaBoost, 2017) \| 0.9284 \| 0.8740 \|
	\| LightGBM alone \| 0.9301 \| — \|
	\| XGBoost alone \| 0.9302 \| — \|
	\| CatBoost alone \| 0.9310 \| — \|

	---

	## Method

	Features (28 total). Six raw numeric features augmented with log-transformed capital variables, binary flags, age/hours bins, and two interaction terms (`education-num × age`, `education-num × hours-per-week`). Categorical columns encoded with `OrdinalEncoder` for LightGBM/XGBoost; CatBoost receives them natively.

	Ensemble. Out-of-fold predictions from the three base learners are combined with fixed weights (LGB 0.1 / XGB 0.3 / CB 0.6). Decision threshold tuned on OOF predictions (0.512).

	Tuning. Optuna TPE, 3-fold inner CV: 40 trials for LightGBM, 40 for XGBoost, 25 for CatBoost.

	### Optimised hyperparameters

	```python
	LGB = {"n_estimators": 1118, "learning_rate": 0.0115, "num_leaves": 90, "max_depth": 6}
	XGB = {"n_estimators": 941, "learning_rate": 0.0488, "max_depth": 6, "gamma": 0.518}
	CB = {"iterations": 778, "learning_rate": 0.0938, "depth": 4, "l2_leaf_reg": 0.057}
	```

	---

	## Cross-validation results (10-fold)

	```
	Fold 1 0.9270
	Fold 2 0.9299
	Fold 3 0.9319
	Fold 4 0.9295
	Fold 5 0.9293
	Fold 6 0.9351
	Fold 7 0.9368
	Fold 8 0.9300
	Fold 9 0.9342
	Fold 10 0.9295
	──────────────
	Mean 0.9313 ± 0.0029
	```

	---

	## Usage

	```python
	import joblib, catboost as cb
	from huggingface_hub import hf_hub_download

	lgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "lgb_model.pkl"))
	xgb_model = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "xgb_model.pkl"))
	encoder = joblib.load(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "ordinal_encoder.pkl"))
	cb_model = cb.CatBoostClassifier()
	cb_model.load_model(hf_hub_download("AurelPx/BoostingEnsemble-Income-Classification", "cb_model.cbm"))

	# Build X_enc (28 features) and X_cb_df (21 cols, native categoricals) — see train.py
	proba = 0.1 * lgb_model.predict_proba(X_enc)[:, 1] \
	+ 0.3 * xgb_model.predict_proba(X_enc)[:, 1] \
	+ 0.6 * cb_model.predict_proba(X_cb_df)[:, 1]
	labels = (proba >= 0.512).astype(int) # 1 → >50K
	```

	Full preprocessing pipeline in `train.py`.

	---

	## Repository contents

	\| File \| Description \|
	\|---\|---\|
	\| `lgb_model.pkl` \| LightGBM classifier (full dataset) \|
	\| `xgb_model.pkl` \| XGBoost classifier (full dataset) \|
	\| `cb_model.cbm` \| CatBoost classifier (native format) \|
	\| `ordinal_encoder.pkl` \| Fitted sklearn OrdinalEncoder \|
	\| `train.py` \| Reproducible training script \|
	\| `metadata.json` \| Results and hyperparameters \|

	---

	## Citation

	```bibtex
	@misc{aurelPx2026incomeclassifier,
	author = {AurelPx},
	title = {Stacked GBM Ensemble for Income Classification (OpenML Task 7592)},
	year = {2026},
	url = {https://huggingface.co/AurelPx/BoostingEnsemble-Income-Classification}
	}
	```