--- license: apache-2.0 tags: - xgboost - classification - grant-matching - win-probability - nonprofit - foundation-grants datasets: - ArkMaster123/grantpilot-training-data language: - en --- # GrantPilot Win Probability Classifier V2 (Federal + Foundation) XGBoost classifier for predicting grant funding success. V2 extends coverage from federal-only (NIH/NSF) to include **37,684 private foundations**. > **See also:** [V1 (federal-only)](https://huggingface.co/ArkMaster123/grantpilot-classifier) ## Performance | Metric | V1 (Federal Only) | V2 (Combined) | Change | |--------|-------------------|---------------|--------| | **Overall AUC-ROC** | 0.837 | **0.997** | +19.1% | | **Federal AUC** | 0.837 | **0.913** | +9.1% | | Brier Score | 0.167 | **0.014** | -91.6% | | Accuracy | 72.1% | **98.3%** | +26.2% | | Precision | 47.4% | **97.1%** | +49.7% | | Recall | 79.9% | **99.6%** | +19.7% | | F1 Score | 0.595 | **0.983** | +65.2% | ### Federal Regression Check: PASS Federal-only AUC improved from 0.837 to **0.913**, well above the 0.817 minimum threshold. ## Important Context The classifier is excellent, but the **embedding model feeding it is not** — see [grantpilot-embedding-v2](https://huggingface.co/ArkMaster123/grantpilot-embedding-v2) benchmark results. The V2 embedding underperforms OpenAI on retrieval (unlike V1 which beat OpenAI). The classifier compensates because it uses multiple features beyond just cosine similarity. ## Model Architecture ``` Input Features: ├── cosine_similarity (from grantpilot-embedding-v2) ├── funder_type (categorical: FOUNDATION, FEDERAL) ├── source (categorical: NIH, NSF, FOUNDATIONS) ├── log_amount (grant amount) ├── org_text_length └── grant_text_length → XGBoost Classifier → Isotonic Calibration → Win Probability (0-100%) ``` ## Training Data | Split | Foundation | NIH | NSF | Total | |-------|-----------|-----|-----|-------| | Train | 584,802 | 51,434 | 12,638 | 648,874 | | Val | 73,240 | 6,445 | 1,599 | 81,284 | | Test | 73,022 | 6,384 | 1,588 | 80,994 | Foundation data sourced from IRS 990-PF e-filings via GivingTuesday (680,970 grants, 88% with purpose text). ## Training Details - **Hardware**: NVIDIA H100 80GB - **XGBoost**: max_depth=6, lr=0.1, n_estimators=200, subsample=0.8 - **Calibration**: Isotonic regression on validation set - **Batch Size**: 256 for embedding feature computation ## Usage ```python import xgboost as xgb import pickle from huggingface_hub import hf_hub_download # Download model files model_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "xgboost_model.json") scaler_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "scaler.pkl") calibrator_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "isotonic_calibrator.pkl") # Load model = xgb.Booster() model.load_model(model_path) with open(scaler_path, "rb") as f: scaler = pickle.load(f) with open(calibrator_path, "rb") as f: calibrator = pickle.load(f) # Predict features_scaled = scaler.transform(features) dmatrix = xgb.DMatrix(features_scaled) raw_pred = model.predict(dmatrix) win_probability = calibrator.predict(raw_pred) * 100 ``` ## Related Models | Model | Description | |-------|-------------| | [grantpilot-embedding-v2](https://huggingface.co/ArkMaster123/grantpilot-embedding-v2) | V2 embedding (required for cosine_similarity feature) | | [grantpilot-embedding](https://huggingface.co/ArkMaster123/grantpilot-embedding) | V1 — federal-only, beats OpenAI on retrieval | | [grantpilot-classifier](https://huggingface.co/ArkMaster123/grantpilot-classifier) | V1 — federal-only classifier (AUC 0.837) | | [grantpilot-training-data](https://huggingface.co/datasets/ArkMaster123/grantpilot-training-data) | Training data (V1 at training/, V2 at training_v2/) |