| --- |
| license: apache-2.0 |
| tags: |
| - xgboost |
| - classification |
| - grant-matching |
| - win-probability |
| - nonprofit |
| - foundation-grants |
| datasets: |
| - ArkMaster123/grantpilot-training-data |
| language: |
| - en |
| --- |
| |
| # GrantPilot Win Probability Classifier V2 (Federal + Foundation) |
|
|
| XGBoost classifier for predicting grant funding success. V2 extends coverage from federal-only (NIH/NSF) to include **37,684 private foundations**. |
|
|
| > **See also:** [V1 (federal-only)](https://huggingface.co/ArkMaster123/grantpilot-classifier) |
|
|
| ## Performance |
|
|
| | Metric | V1 (Federal Only) | V2 (Combined) | Change | |
| |--------|-------------------|---------------|--------| |
| | **Overall AUC-ROC** | 0.837 | **0.997** | +19.1% | |
| | **Federal AUC** | 0.837 | **0.913** | +9.1% | |
| | Brier Score | 0.167 | **0.014** | -91.6% | |
| | Accuracy | 72.1% | **98.3%** | +26.2% | |
| | Precision | 47.4% | **97.1%** | +49.7% | |
| | Recall | 79.9% | **99.6%** | +19.7% | |
| | F1 Score | 0.595 | **0.983** | +65.2% | |
|
|
| ### Federal Regression Check: PASS |
|
|
| Federal-only AUC improved from 0.837 to **0.913**, well above the 0.817 minimum threshold. |
|
|
| ## Important Context |
|
|
| The classifier is excellent, but the **embedding model feeding it is not** β see [grantpilot-embedding-v2](https://huggingface.co/ArkMaster123/grantpilot-embedding-v2) benchmark results. The V2 embedding underperforms OpenAI on retrieval (unlike V1 which beat OpenAI). The classifier compensates because it uses multiple features beyond just cosine similarity. |
|
|
| ## Model Architecture |
|
|
| ``` |
| Input Features: |
| βββ cosine_similarity (from grantpilot-embedding-v2) |
| βββ funder_type (categorical: FOUNDATION, FEDERAL) |
| βββ source (categorical: NIH, NSF, FOUNDATIONS) |
| βββ log_amount (grant amount) |
| βββ org_text_length |
| βββ grant_text_length |
| |
| β XGBoost Classifier |
| β Isotonic Calibration |
| β Win Probability (0-100%) |
| ``` |
|
|
| ## Training Data |
|
|
| | Split | Foundation | NIH | NSF | Total | |
| |-------|-----------|-----|-----|-------| |
| | Train | 584,802 | 51,434 | 12,638 | 648,874 | |
| | Val | 73,240 | 6,445 | 1,599 | 81,284 | |
| | Test | 73,022 | 6,384 | 1,588 | 80,994 | |
|
|
| Foundation data sourced from IRS 990-PF e-filings via GivingTuesday (680,970 grants, 88% with purpose text). |
|
|
| ## Training Details |
|
|
| - **Hardware**: NVIDIA H100 80GB |
| - **XGBoost**: max_depth=6, lr=0.1, n_estimators=200, subsample=0.8 |
| - **Calibration**: Isotonic regression on validation set |
| - **Batch Size**: 256 for embedding feature computation |
|
|
| ## Usage |
|
|
| ```python |
| import xgboost as xgb |
| import pickle |
| from huggingface_hub import hf_hub_download |
| |
| # Download model files |
| model_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "xgboost_model.json") |
| scaler_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "scaler.pkl") |
| calibrator_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "isotonic_calibrator.pkl") |
| |
| # Load |
| model = xgb.Booster() |
| model.load_model(model_path) |
| |
| with open(scaler_path, "rb") as f: |
| scaler = pickle.load(f) |
| with open(calibrator_path, "rb") as f: |
| calibrator = pickle.load(f) |
| |
| # Predict |
| features_scaled = scaler.transform(features) |
| dmatrix = xgb.DMatrix(features_scaled) |
| raw_pred = model.predict(dmatrix) |
| win_probability = calibrator.predict(raw_pred) * 100 |
| ``` |
|
|
| ## Related Models |
|
|
| | Model | Description | |
| |-------|-------------| |
| | [grantpilot-embedding-v2](https://huggingface.co/ArkMaster123/grantpilot-embedding-v2) | V2 embedding (required for cosine_similarity feature) | |
| | [grantpilot-embedding](https://huggingface.co/ArkMaster123/grantpilot-embedding) | V1 β federal-only, beats OpenAI on retrieval | |
| | [grantpilot-classifier](https://huggingface.co/ArkMaster123/grantpilot-classifier) | V1 β federal-only classifier (AUC 0.837) | |
| | [grantpilot-training-data](https://huggingface.co/datasets/ArkMaster123/grantpilot-training-data) | Training data (V1 at training/, V2 at training_v2/) | |
|
|