# HyperOptimized Gradient Boosted Trees (HyperOpt-GBT) ## A scikit-learn compatible, state-of-the-art gradient boosted tree library combining the best innovations from YDF, CatBoost, XGBoost, and LightGBM --- ## 1. Executive Summary This library implements a **hyperoptimized gradient boosted tree (GBT) architecture** that synthesizes the best ideas from: - **Yggdrasil Decision Forests (YDF)**: Modularity, inference engine compilation, model self-evaluation - **CatBoost**: Ordered boosting (eliminates prediction shift), ordered target statistics for categorical features, oblivious trees - **XGBoost**: Weighted quantile sketch, cache-aware access, sparsity-aware algorithms, column block structure - **LightGBM**: Gradient-based One-Side Sampling (GOSS), Exclusive Feature Bundling (EFB), histogram-based splits, leaf-wise tree growth **Key claim**: By combining these innovations intelligently, HyperOpt-GBT achieves both **higher accuracy** (via unbiased boosting and better categorical handling) and **faster training/inference** (via GOSS, histogram splits, optimized inference engines) than any individual library. --- ## 2. Architecture Overview ``` HyperOpt-GBT ├── Core Module: Modularity (from YDF) │ ├── Learner-Model abstraction │ ├── Pluggable splitters (exact, histogram, distributed) │ ├── Pluggable inference engines │ └── Meta-learners (hyperparameter tuning) │ ├── Training Optimizations │ ├── Split Finding │ │ ├── Histogram-based split finding (LightGBM) │ │ ├── Weighted quantile sketch (XGBoost) │ │ └── Exact greedy (fallback for small datasets) │ ├── Sampling │ │ ├── GOSS: Gradient-based One-Side Sampling (LightGBM) │ │ └── Ordered boosting: unbiased residuals (CatBoost) │ ├── Feature Handling │ │ ├── Exclusive Feature Bundling (LightGBM) │ │ ├── Ordered Target Statistics (CatBoost) │ │ └── Native categorical splitting (CatBoost/YDF) │ ├── Tree Growth │ │ ├── Leaf-wise with depth limit (LightGBM) │ │ └── Level-wise (XGBoost-style, configurable) │ └── Cache & Memory │ ├── Column block structure (XGBoost) │ ├── Cache-aware prefetching (XGBoost) │ └── Out-of-core computation (XGBoost) │ ├── Inference Optimizations │ ├── Engine Compilation (YDF) │ │ ├── SIMD vectorized tree traversal │ │ ├── Batch prediction optimization │ │ └── Model-specific compiled engine │ ├── QuickScorer-style fast scoring │ └── Hardware-specific backends │ ├── CPU: AVX2/AVX-512 vectorization │ └── GPU: CUDA kernels (optional) │ └── scikit-learn Integration ├── sklearn.base.BaseEstimator ├── sklearn.base.ClassifierMixin / RegressorMixin └── Full pipeline compatibility ``` --- ## 3. Detailed Innovations ### 3.1 Ordered Boosting (from CatBoost) - THE ACCURACY BREAKTHROUGH **Problem**: Standard gradient boosting suffers from **prediction shift** - the same data used to compute gradients was also used to train previous trees, causing biased residuals. **Solution**: Ordered boosting maintains multiple supporting models trained on prefixes of a random permutation. ``` Standard GBDT (biased): residual_i = y_i - F_{t-1}(x_i) ← F_{t-1} trained WITH example i Ordered Boosting (unbiased): residual_i = y_i - M_{σ(i)-1}(x_i) ← M_{σ(i)-1} trained WITHOUT example i ``` **Implementation**: Instead of n separate models (infeasible), CatBoost uses: - s+1 random permutations σ₀, σ₁, ..., σₛ - Supporting models M_{r,j} trained on first j examples of permutation σᵣ - Tree structure shared across all models, but leaf values differ - Time complexity: O(s × n × log(depth)) per iteration **HyperOpt-GBT Enhancement**: We combine ordered boosting with GOSS - only keeping high-gradient instances from the ordered permutation, making ordered boosting practical for billion-scale datasets. ### 3.2 Ordered Target Statistics (from CatBoost) - CATEGORICAL FEATURE MASTERY **Problem**: One-hot encoding explodes dimensionality. Greedy target statistics leak target information. **Solution**: For categorical feature i, replace category xₖⁱ with: ``` TS_k^i = (Σ_{j:σ(j)<σ(k), x_j^i=x_k^i} y_j + a·p) / (Σ_{j:σ(j)<σ(k), x_j^i=x_k^i} 1 + a) ``` Where σ is a random permutation, a is prior weight, p is global prior. **Properties**: - No target leakage: only preceding examples in permutation used - O(1) per category memory (vs one-hot = O(categories)) - Exact optimal splits found among TS thresholds **HyperOpt-GBT Enhancement**: Combine with YDF's categorical-set splits for multi-value categorical features. ### 3.3 Gradient-based One-Side Sampling (GOSS) (from LightGBM) - THE SPEED BREAKTHROUGH **Problem**: Computing splits over all instances is O(n) per feature per node. **Insight**: Instances with small gradients are already well-trained and contribute little to split quality. **Algorithm**: 1. Sort instances by absolute gradient 2. Keep top-a% instances with largest gradients ("a" parameter) 3. Randomly sample b% from remaining ("b" parameter) 4. Amplify sampled small-gradient instances by factor (1-a)/b **Result**: Train on ~a% + b% of data with minimal accuracy loss. Typical: a=0.2, b=0.1 → use 30% of data. **HyperOpt-GBT Enhancement**: Apply GOSS within each ordered permutation subset. Since ordered boosting already processes examples in permutation order, GOSS naturally applies to the gradient computation step. ### 3.4 Exclusive Feature Bundling (EFB) (from LightGBM) **Problem**: High-dimensional sparse features (e.g., one-hot) waste memory and computation. **Insight**: Mutually exclusive features (rarely non-zero simultaneously) can be bundled into single feature. **Algorithm**: 1. Build graph: features are vertices, edges weighted by total conflicts 2. Greedy coloring to bundle features into groups 3. Merge bundles by adding offsets: new_feature = Σ fᵢ + offsetᵢ **Result**: Reduce features from D to D' << D with negligible accuracy loss. **HyperOpt-GBT Enhancement**: Integrate with categorical TS - bundle low-cardinality categorical features after TS conversion. ### 3.5 Histogram-Based Split Finding (from LightGBM) **Problem**: Exact split finding requires O(n log n) sorting per feature per node. **Solution**: Pre-discretize continuous features into k bins (typically k=255). **Algorithm**: 1. Compute feature histograms: count and gradient sums per bin 2. Find optimal split by scanning bins (O(k) vs O(n log n)) 3. Reuse histograms across nodes where possible **Complexity**: O(n × k) preprocessing, then O(k) per feature per node. **HyperOpt-GBT Enhancement**: Adaptive binning - use weighted quantile sketch (XGBoost) for bin boundaries instead of uniform bins. This handles outliers better and maintains accuracy. ### 3.6 Weighted Quantile Sketch (from XGBoost) **Problem**: Uniform binning loses accuracy for skewed distributions. **Solution**: Find candidate split points where each bucket contains roughly equal *total hessian weight* (not equal count). ``` r_k(z) = (1/Σh) · Σ_{x