metadata
license: apache-2.0
tags:
- xgboost
- classification
- grant-matching
- win-probability
- nonprofit
- foundation-grants
datasets:
- ArkMaster123/grantpilot-training-data
language:
- en
GrantPilot Win Probability Classifier V2 (Federal + Foundation)
XGBoost classifier for predicting grant funding success. V2 extends coverage from federal-only (NIH/NSF) to include 37,684 private foundations.
See also: V1 (federal-only)
Performance
| Metric | V1 (Federal Only) | V2 (Combined) | Change |
|---|---|---|---|
| Overall AUC-ROC | 0.837 | 0.997 | +19.1% |
| Federal AUC | 0.837 | 0.913 | +9.1% |
| Brier Score | 0.167 | 0.014 | -91.6% |
| Accuracy | 72.1% | 98.3% | +26.2% |
| Precision | 47.4% | 97.1% | +49.7% |
| Recall | 79.9% | 99.6% | +19.7% |
| F1 Score | 0.595 | 0.983 | +65.2% |
Federal Regression Check: PASS
Federal-only AUC improved from 0.837 to 0.913, well above the 0.817 minimum threshold.
Important Context
The classifier is excellent, but the embedding model feeding it is not β see grantpilot-embedding-v2 benchmark results. The V2 embedding underperforms OpenAI on retrieval (unlike V1 which beat OpenAI). The classifier compensates because it uses multiple features beyond just cosine similarity.
Model Architecture
Input Features:
βββ cosine_similarity (from grantpilot-embedding-v2)
βββ funder_type (categorical: FOUNDATION, FEDERAL)
βββ source (categorical: NIH, NSF, FOUNDATIONS)
βββ log_amount (grant amount)
βββ org_text_length
βββ grant_text_length
β XGBoost Classifier
β Isotonic Calibration
β Win Probability (0-100%)
Training Data
| Split | Foundation | NIH | NSF | Total |
|---|---|---|---|---|
| Train | 584,802 | 51,434 | 12,638 | 648,874 |
| Val | 73,240 | 6,445 | 1,599 | 81,284 |
| Test | 73,022 | 6,384 | 1,588 | 80,994 |
Foundation data sourced from IRS 990-PF e-filings via GivingTuesday (680,970 grants, 88% with purpose text).
Training Details
- Hardware: NVIDIA H100 80GB
- XGBoost: max_depth=6, lr=0.1, n_estimators=200, subsample=0.8
- Calibration: Isotonic regression on validation set
- Batch Size: 256 for embedding feature computation
Usage
import xgboost as xgb
import pickle
from huggingface_hub import hf_hub_download
# Download model files
model_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "xgboost_model.json")
scaler_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "scaler.pkl")
calibrator_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "isotonic_calibrator.pkl")
# Load
model = xgb.Booster()
model.load_model(model_path)
with open(scaler_path, "rb") as f:
scaler = pickle.load(f)
with open(calibrator_path, "rb") as f:
calibrator = pickle.load(f)
# Predict
features_scaled = scaler.transform(features)
dmatrix = xgb.DMatrix(features_scaled)
raw_pred = model.predict(dmatrix)
win_probability = calibrator.predict(raw_pred) * 100
Related Models
| Model | Description |
|---|---|
| grantpilot-embedding-v2 | V2 embedding (required for cosine_similarity feature) |
| grantpilot-embedding | V1 β federal-only, beats OpenAI on retrieval |
| grantpilot-classifier | V1 β federal-only classifier (AUC 0.837) |
| grantpilot-training-data | Training data (V1 at training/, V2 at training_v2/) |