File size: 3,867 Bytes
d5ab669
 
 
 
 
 
 
 
00cea16
d5ab669
 
 
 
 
 
00cea16
d5ab669
00cea16
d5ab669
00cea16
d5ab669
00cea16
d5ab669
00cea16
 
 
 
 
 
 
 
 
d5ab669
00cea16
 
 
 
 
 
 
d5ab669
 
 
 
 
00cea16
 
 
 
 
 
 
 
 
 
d5ab669
 
00cea16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5ab669
 
 
 
 
 
 
 
00cea16
 
 
d5ab669
00cea16
d5ab669
 
 
 
 
 
 
 
00cea16
d5ab669
 
 
 
 
 
 
 
00cea16
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: apache-2.0
tags:
  - xgboost
  - classification
  - grant-matching
  - win-probability
  - nonprofit
  - foundation-grants
datasets:
  - ArkMaster123/grantpilot-training-data
language:
  - en
---

# GrantPilot Win Probability Classifier V2 (Federal + Foundation)

XGBoost classifier for predicting grant funding success. V2 extends coverage from federal-only (NIH/NSF) to include **37,684 private foundations**.

> **See also:** [V1 (federal-only)](https://huggingface.co/ArkMaster123/grantpilot-classifier)

## Performance

| Metric | V1 (Federal Only) | V2 (Combined) | Change |
|--------|-------------------|---------------|--------|
| **Overall AUC-ROC** | 0.837 | **0.997** | +19.1% |
| **Federal AUC** | 0.837 | **0.913** | +9.1% |
| Brier Score | 0.167 | **0.014** | -91.6% |
| Accuracy | 72.1% | **98.3%** | +26.2% |
| Precision | 47.4% | **97.1%** | +49.7% |
| Recall | 79.9% | **99.6%** | +19.7% |
| F1 Score | 0.595 | **0.983** | +65.2% |

### Federal Regression Check: PASS

Federal-only AUC improved from 0.837 to **0.913**, well above the 0.817 minimum threshold.

## Important Context

The classifier is excellent, but the **embedding model feeding it is not** β€” see [grantpilot-embedding-v2](https://huggingface.co/ArkMaster123/grantpilot-embedding-v2) benchmark results. The V2 embedding underperforms OpenAI on retrieval (unlike V1 which beat OpenAI). The classifier compensates because it uses multiple features beyond just cosine similarity.

## Model Architecture

```
Input Features:
β”œβ”€β”€ cosine_similarity (from grantpilot-embedding-v2)
β”œβ”€β”€ funder_type (categorical: FOUNDATION, FEDERAL)
β”œβ”€β”€ source (categorical: NIH, NSF, FOUNDATIONS)
β”œβ”€β”€ log_amount (grant amount)
β”œβ”€β”€ org_text_length
└── grant_text_length

β†’ XGBoost Classifier
β†’ Isotonic Calibration
β†’ Win Probability (0-100%)
```

## Training Data

| Split | Foundation | NIH | NSF | Total |
|-------|-----------|-----|-----|-------|
| Train | 584,802 | 51,434 | 12,638 | 648,874 |
| Val | 73,240 | 6,445 | 1,599 | 81,284 |
| Test | 73,022 | 6,384 | 1,588 | 80,994 |

Foundation data sourced from IRS 990-PF e-filings via GivingTuesday (680,970 grants, 88% with purpose text).

## Training Details

- **Hardware**: NVIDIA H100 80GB
- **XGBoost**: max_depth=6, lr=0.1, n_estimators=200, subsample=0.8
- **Calibration**: Isotonic regression on validation set
- **Batch Size**: 256 for embedding feature computation

## Usage

```python
import xgboost as xgb
import pickle
from huggingface_hub import hf_hub_download

# Download model files
model_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "xgboost_model.json")
scaler_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "scaler.pkl")
calibrator_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "isotonic_calibrator.pkl")

# Load
model = xgb.Booster()
model.load_model(model_path)

with open(scaler_path, "rb") as f:
    scaler = pickle.load(f)
with open(calibrator_path, "rb") as f:
    calibrator = pickle.load(f)

# Predict
features_scaled = scaler.transform(features)
dmatrix = xgb.DMatrix(features_scaled)
raw_pred = model.predict(dmatrix)
win_probability = calibrator.predict(raw_pred) * 100
```

## Related Models

| Model | Description |
|-------|-------------|
| [grantpilot-embedding-v2](https://huggingface.co/ArkMaster123/grantpilot-embedding-v2) | V2 embedding (required for cosine_similarity feature) |
| [grantpilot-embedding](https://huggingface.co/ArkMaster123/grantpilot-embedding) | V1 β€” federal-only, beats OpenAI on retrieval |
| [grantpilot-classifier](https://huggingface.co/ArkMaster123/grantpilot-classifier) | V1 β€” federal-only classifier (AUC 0.837) |
| [grantpilot-training-data](https://huggingface.co/datasets/ArkMaster123/grantpilot-training-data) | Training data (V1 at training/, V2 at training_v2/) |