File size: 4,470 Bytes
d919f3d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | # HyperOpt-GBT
**HyperOptimized Gradient Boosted Trees** — a scikit-learn compatible library that combines the best innovations from XGBoost, LightGBM, CatBoost, and YDF into one implementation.
## Key Innovations
| Innovation | Source | Effect |
|---|---|---|
| **GOSS** (Gradient-based One-Side Sampling) | LightGBM | 2-5× faster training, often *better* accuracy |
| **Weighted Quantile Sketch** | XGBoost | +15-19% AUC on skewed distributions |
| **Ordered Boosting** | CatBoost | Eliminates prediction shift → unbiased residuals |
| **Ordered Target Statistics** | CatBoost | Handles categoricals without target leakage |
| **Histogram-based Splits** | LightGBM | O(k) split finding vs O(n log n) |
| **Compiled Inference Engines** | YDF | 5-100× faster prediction |
| **Oblivious Trees** | CatBoost | Regularization + SIMD-friendly structure |
| **Cache-aware Column Blocks** | XGBoost | Cache-friendly memory access |
## Quick Start
```python
from hyperopt_gbt import HyperOptGradientBoostedClassifier
clf = HyperOptGradientBoostedClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=6,
use_goss=True, # LightGBM: gradient-based sampling
binning='quantile_sketch', # XGBoost: adaptive bin boundaries
n_bins=255,
)
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_test)
```
## Installation
```bash
# From source
pip install -e .
# With benchmark dependencies
pip install -e ".[benchmark]"
# Build Rust backend (optional, for maximum speed)
cd rust_gbt && pip install maturin && maturin develop --release
```
## Benchmark Results
### Binary Classification (80K train, 20K test, 50 trees)
| Library | AUC | Train Time |
|---|---|---|
| **HyperOpt-GBT (GOSS)** | **0.9691** | 2.5s |
| XGBoost (hist) | 0.9661 | 1.3s |
| LightGBM | 0.9659 | 1.0s |
| CatBoost | 0.9756 | 1.5s |
### GOSS: Faster AND More Accurate
| Data Used | AUC | Speedup |
|---|---|---|
| 100% (no GOSS) | 0.9659 | 1.0× |
| 40% (GOSS) | 0.9717 | 2.4× |
| **15% (GOSS)** | **0.9740** | **5.3×** |
### Quantile Sketch vs Uniform (Skewed Data)
| Bins | Uniform AUC | Quantile AUC | Gain |
|---|---|---|---|
| 63 | 0.6426 | 0.8306 | **+18.8%** |
| 255 | 0.6775 | 0.8295 | **+15.2%** |
## API Reference
### Classifier
```python
HyperOptGradientBoostedClassifier(
# Core
n_estimators=100, # Number of boosting rounds
learning_rate=0.1, # Shrinkage
max_depth=6, # Maximum tree depth
# Accuracy innovations
ordered_boosting=False, # CatBoost: unbiased boosting
ordered_ts=True, # CatBoost: ordered target statistics
oblivious_trees=False, # CatBoost: balanced trees
# Speed innovations
use_goss=True, # LightGBM: gradient sampling
goss_a=0.2, # Keep top 20% by gradient magnitude
goss_b=0.1, # Sample 10% from rest
n_bins=255, # Histogram bins
binning='uniform', # 'uniform' or 'quantile_sketch'
# Regularization
l2_reg=1.0, # L2 on leaf weights
min_child_weight=1.0, # Min hessian sum in leaf
subsample=1.0, # Row subsampling
colsample_bytree=1.0, # Column subsampling
)
```
### Regressor
```python
HyperOptGradientBoostedRegressor(
# Same parameters as classifier
)
```
### Inference Engines
```python
from hyperopt_gbt import compile_inference_engine
engine = compile_inference_engine(model, engine_type='auto')
# Options: 'naive', 'flat', 'simd', 'quickscorer', 'auto'
predictions = engine.predict(X_binned)
```
## Rust Backend
The optional Rust backend provides the fastest training via:
- **Rayon** parallelism for histogram building across features
- **Flat tree arrays** (`Vec<TreeNode>`) — no pointer chasing
- **Zero-copy NumPy interop** via PyO3
- **LTO + native CPU** in release mode
```python
import rust_gbt
model = rust_gbt.PyRustGBT()
model.fit(X_train, y_train,
n_estimators=50, learning_rate=0.1, max_depth=6,
use_goss=True, goss_a=0.2, goss_b=0.1,
binning="quantile", task="classification")
proba = model.predict_proba(X_test)
```
## Run Benchmarks
```bash
python benchmark_quick.py
```
## Architecture
See [ARCHITECTURE.md](ARCHITECTURE.md) for the full technical design.
See [RESULTS.md](RESULTS.md) for detailed benchmark results.
See [WHY_HYPEROPT_GBT.md](WHY_HYPEROPT_GBT.md) for the motivation.
## License
Apache 2.0
|