| # HyperOpt-GBT |
|
|
| **HyperOptimized Gradient Boosted Trees** — a scikit-learn compatible library that combines the best innovations from XGBoost, LightGBM, CatBoost, and YDF into one implementation. |
|
|
| ## Key Innovations |
|
|
| | Innovation | Source | Effect | |
| |---|---|---| |
| | **GOSS** (Gradient-based One-Side Sampling) | LightGBM | 2-5× faster training, often *better* accuracy | |
| | **Weighted Quantile Sketch** | XGBoost | +15-19% AUC on skewed distributions | |
| | **Ordered Boosting** | CatBoost | Eliminates prediction shift → unbiased residuals | |
| | **Ordered Target Statistics** | CatBoost | Handles categoricals without target leakage | |
| | **Histogram-based Splits** | LightGBM | O(k) split finding vs O(n log n) | |
| | **Compiled Inference Engines** | YDF | 5-100× faster prediction | |
| | **Oblivious Trees** | CatBoost | Regularization + SIMD-friendly structure | |
| | **Cache-aware Column Blocks** | XGBoost | Cache-friendly memory access | |
|
|
| ## Quick Start |
|
|
| ```python |
| from hyperopt_gbt import HyperOptGradientBoostedClassifier |
| |
| clf = HyperOptGradientBoostedClassifier( |
| n_estimators=100, |
| learning_rate=0.1, |
| max_depth=6, |
| use_goss=True, # LightGBM: gradient-based sampling |
| binning='quantile_sketch', # XGBoost: adaptive bin boundaries |
| n_bins=255, |
| ) |
| |
| clf.fit(X_train, y_train) |
| proba = clf.predict_proba(X_test) |
| ``` |
|
|
| ## Installation |
|
|
| ```bash |
| # From source |
| pip install -e . |
| |
| # With benchmark dependencies |
| pip install -e ".[benchmark]" |
| |
| # Build Rust backend (optional, for maximum speed) |
| cd rust_gbt && pip install maturin && maturin develop --release |
| ``` |
|
|
| ## Benchmark Results |
|
|
| ### Binary Classification (80K train, 20K test, 50 trees) |
|
|
| | Library | AUC | Train Time | |
| |---|---|---| |
| | **HyperOpt-GBT (GOSS)** | **0.9691** | 2.5s | |
| | XGBoost (hist) | 0.9661 | 1.3s | |
| | LightGBM | 0.9659 | 1.0s | |
| | CatBoost | 0.9756 | 1.5s | |
|
|
| ### GOSS: Faster AND More Accurate |
|
|
| | Data Used | AUC | Speedup | |
| |---|---|---| |
| | 100% (no GOSS) | 0.9659 | 1.0× | |
| | 40% (GOSS) | 0.9717 | 2.4× | |
| | **15% (GOSS)** | **0.9740** | **5.3×** | |
|
|
| ### Quantile Sketch vs Uniform (Skewed Data) |
|
|
| | Bins | Uniform AUC | Quantile AUC | Gain | |
| |---|---|---|---| |
| | 63 | 0.6426 | 0.8306 | **+18.8%** | |
| | 255 | 0.6775 | 0.8295 | **+15.2%** | |
|
|
| ## API Reference |
|
|
| ### Classifier |
|
|
| ```python |
| HyperOptGradientBoostedClassifier( |
| # Core |
| n_estimators=100, # Number of boosting rounds |
| learning_rate=0.1, # Shrinkage |
| max_depth=6, # Maximum tree depth |
| |
| # Accuracy innovations |
| ordered_boosting=False, # CatBoost: unbiased boosting |
| ordered_ts=True, # CatBoost: ordered target statistics |
| oblivious_trees=False, # CatBoost: balanced trees |
| |
| # Speed innovations |
| use_goss=True, # LightGBM: gradient sampling |
| goss_a=0.2, # Keep top 20% by gradient magnitude |
| goss_b=0.1, # Sample 10% from rest |
| n_bins=255, # Histogram bins |
| binning='uniform', # 'uniform' or 'quantile_sketch' |
| |
| # Regularization |
| l2_reg=1.0, # L2 on leaf weights |
| min_child_weight=1.0, # Min hessian sum in leaf |
| subsample=1.0, # Row subsampling |
| colsample_bytree=1.0, # Column subsampling |
| ) |
| ``` |
|
|
| ### Regressor |
|
|
| ```python |
| HyperOptGradientBoostedRegressor( |
| # Same parameters as classifier |
| ) |
| ``` |
|
|
| ### Inference Engines |
|
|
| ```python |
| from hyperopt_gbt import compile_inference_engine |
| |
| engine = compile_inference_engine(model, engine_type='auto') |
| # Options: 'naive', 'flat', 'simd', 'quickscorer', 'auto' |
| |
| predictions = engine.predict(X_binned) |
| ``` |
|
|
| ## Rust Backend |
|
|
| The optional Rust backend provides the fastest training via: |
| - **Rayon** parallelism for histogram building across features |
| - **Flat tree arrays** (`Vec<TreeNode>`) — no pointer chasing |
| - **Zero-copy NumPy interop** via PyO3 |
| - **LTO + native CPU** in release mode |
|
|
| ```python |
| import rust_gbt |
| |
| model = rust_gbt.PyRustGBT() |
| model.fit(X_train, y_train, |
| n_estimators=50, learning_rate=0.1, max_depth=6, |
| use_goss=True, goss_a=0.2, goss_b=0.1, |
| binning="quantile", task="classification") |
| |
| proba = model.predict_proba(X_test) |
| ``` |
|
|
| ## Run Benchmarks |
|
|
| ```bash |
| python benchmark_quick.py |
| ``` |
|
|
| ## Architecture |
|
|
| See [ARCHITECTURE.md](ARCHITECTURE.md) for the full technical design. |
|
|
| See [RESULTS.md](RESULTS.md) for detailed benchmark results. |
|
|
| See [WHY_HYPEROPT_GBT.md](WHY_HYPEROPT_GBT.md) for the motivation. |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|