ArkMaster123
/

grantpilot-classifier-v2

@@ -6,50 +6,72 @@ tags:
   - grant-matching
   - win-probability
   - nonprofit
 datasets:
   - ArkMaster123/grantpilot-training-data
 language:
   - en
 ---
-# GrantPilot Win Probability Classifier
-**XGBoost classifier for predicting grant funding success**
-This model predicts the probability that a nonprofit organization will win a specific grant, based on embedding similarity and structured features.
-## Performance Metrics
-| Metric | Score | Target |
-|--------|-------|--------|
-| **AUC-ROC** | **0.837** | > 0.75 |
-| Brier Score | 0.167 | < 0.15 |
-| Accuracy | 72.1% | - |
-| Precision | 47.4% | - |
-| Recall | 79.9% | - |
-| F1 Score | 0.595 | - |
-### Key Highlights:
-- **AUC-ROC of 0.837** exceeds target by 12%
-- **High recall (80%)** ensures we catch most winning opportunities
-- Calibrated with isotonic regression for reliable probability estimates
 ## Model Architecture
 ```
 Input Features:
-- cosine_similarity (from fine-tuned embedding model)
-- funder_type (categorical)
-- source (categorical: NIH, NSF)
-- log_amount (grant amount)
-- org_text_length
-- grant_text_length
--> XGBoost Classifier
--> Isotonic Calibration
--> Win Probability (0-100%)
 ```
 ## Usage
 ```python
@@ -58,94 +80,31 @@ import pickle
 from huggingface_hub import hf_hub_download
 # Download model files
-model_path = hf_hub_download("ArkMaster123/grantpilot-classifier", "xgboost_model.json")
-scaler_path = hf_hub_download("ArkMaster123/grantpilot-classifier", "scaler.pkl")
-calibrator_path = hf_hub_download("ArkMaster123/grantpilot-classifier", "isotonic_calibrator.pkl")
-# Load model
 model = xgb.Booster()
 model.load_model(model_path)
 with open(scaler_path, "rb") as f:
     scaler = pickle.load(f)
 with open(calibrator_path, "rb") as f:
     calibrator = pickle.load(f)
-# Predict (after generating features)
 features_scaled = scaler.transform(features)
 dmatrix = xgb.DMatrix(features_scaled)
 raw_pred = model.predict(dmatrix)
 win_probability = calibrator.predict(raw_pred) * 100
 ```
-## Training Details
-- **Hardware**: NVIDIA H100 80GB
-- **Training Data**: 59K training pairs, 7.4K validation, 6.6K test
-- **XGBoost Parameters**:
-  - max_depth: 6
-  - learning_rate: 0.1
-  - n_estimators: 200 (early stopped at 18)
-  - subsample: 0.8
-## Intended Use
-This model is designed to:
-- Predict win probability for grant-organization matches
-- Help nonprofits prioritize grant applications
-- Provide confidence scores for grant recommendations
-## Limitations
-- Trained on federal grants (NIH, NSF) - accuracy may vary for other funders
-- Requires the fine-tuned embedding model for cosine_similarity feature
-- Best used in conjunction with human judgment
 ## Related Models
-- [ArkMaster123/grantpilot-embedding](https://huggingface.co/ArkMaster123/grantpilot-embedding) - Fine-tuned embedding model (required for similarity feature)
----
-## V2.0 Update: Foundation Grants Support (February 2026)
-### What Changed
-V2 extends the model from **federal-only (NIH/NSF)** to also support **foundation grants** (990-PF data from 37,684 private foundations). The training data grew from ~42K federal pairs to **811K combined pairs** across three sources.
-### Training Data (V2)
-| Split | Foundation | NIH | NSF | Total |
-|-------|-----------|-----|-----|-------|
-| Train | 584,802 | 51,434 | 12,638 | 648,874 |
-| Val | 73,240 | 6,445 | 1,599 | 81,284 |
-| Test | 73,022 | 6,384 | 1,588 | 80,994 |
-Data is stratified by source so each split has proportional representation.
-### V2 Performance
-| Metric | V1 (Federal Only) | V2 (Combined) | Change |
-|--------|-------------------|---------------|--------|
-| **Overall AUC-ROC** | 0.837 | **0.997** | +19.1% |
-| **Federal AUC** | 0.837 | **0.913** | +9.1% |
-| Brier Score | 0.167 | **0.014** | -91.6% |
-| Accuracy | 72.1% | **98.3%** | +26.2% |
-| Precision | 47.4% | **97.1%** | +49.7% |
-| Recall | 79.9% | **99.6%** | +19.7% |
-| F1 Score | 0.595 | **0.983** | +65.2% |
-### Federal Regression Check: PASS
-Federal-only AUC improved from 0.837 to **0.913**, well above the 0.817 minimum threshold. Adding foundation data did not degrade federal performance - it improved it.
-### Version Tags
-- `v1.0-federal-only`: Original federal-only model (NIH + NSF)
-- `v2.0-with-foundations`: Combined federal + foundation model
-### Foundation Data Source
-Foundation grant data sourced from IRS 990-PF e-filings via GivingTuesday's open dataset, covering 680,970 grants from 37,684 private foundations (2024 filing year). 88% of grants include purpose text descriptions.

   - grant-matching
   - win-probability
   - nonprofit
+  - foundation-grants
 datasets:
   - ArkMaster123/grantpilot-training-data
 language:
   - en
 ---
+# GrantPilot Win Probability Classifier V2 (Federal + Foundation)
+XGBoost classifier for predicting grant funding success. V2 extends coverage from federal-only (NIH/NSF) to include **37,684 private foundations**.
+> **See also:** [V1 (federal-only)](https://huggingface.co/ArkMaster123/grantpilot-classifier)
+## Performance
+| Metric | V1 (Federal Only) | V2 (Combined) | Change |
+|--------|-------------------|---------------|--------|
+| **Overall AUC-ROC** | 0.837 | **0.997** | +19.1% |
+| **Federal AUC** | 0.837 | **0.913** | +9.1% |
+| Brier Score | 0.167 | **0.014** | -91.6% |
+| Accuracy | 72.1% | **98.3%** | +26.2% |
+| Precision | 47.4% | **97.1%** | +49.7% |
+| Recall | 79.9% | **99.6%** | +19.7% |
+| F1 Score | 0.595 | **0.983** | +65.2% |
+### Federal Regression Check: PASS
+Federal-only AUC improved from 0.837 to **0.913**, well above the 0.817 minimum threshold.
+## Important Context
+The classifier is excellent, but the **embedding model feeding it is not** — see [grantpilot-embedding-v2](https://huggingface.co/ArkMaster123/grantpilot-embedding-v2) benchmark results. The V2 embedding underperforms OpenAI on retrieval (unlike V1 which beat OpenAI). The classifier compensates because it uses multiple features beyond just cosine similarity.
 ## Model Architecture
 ```
 Input Features:
+├── cosine_similarity (from grantpilot-embedding-v2)
+├── funder_type (categorical: FOUNDATION, FEDERAL)
+├── source (categorical: NIH, NSF, FOUNDATIONS)
+├── log_amount (grant amount)
+├── org_text_length
+└── grant_text_length
+→ XGBoost Classifier
+→ Isotonic Calibration
+→ Win Probability (0-100%)
 ```
+## Training Data
+| Split | Foundation | NIH | NSF | Total |
+|-------|-----------|-----|-----|-------|
+| Train | 584,802 | 51,434 | 12,638 | 648,874 |
+| Val | 73,240 | 6,445 | 1,599 | 81,284 |
+| Test | 73,022 | 6,384 | 1,588 | 80,994 |
+Foundation data sourced from IRS 990-PF e-filings via GivingTuesday (680,970 grants, 88% with purpose text).
+## Training Details
+- **Hardware**: NVIDIA H100 80GB
+- **XGBoost**: max_depth=6, lr=0.1, n_estimators=200, subsample=0.8
+- **Calibration**: Isotonic regression on validation set
+- **Batch Size**: 256 for embedding feature computation
 ## Usage
 ```python
 from huggingface_hub import hf_hub_download
 # Download model files
+model_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "xgboost_model.json")
+scaler_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "scaler.pkl")
+calibrator_path = hf_hub_download("ArkMaster123/grantpilot-classifier-v2", "isotonic_calibrator.pkl")
+# Load
 model = xgb.Booster()
 model.load_model(model_path)
 with open(scaler_path, "rb") as f:
     scaler = pickle.load(f)
 with open(calibrator_path, "rb") as f:
     calibrator = pickle.load(f)
+# Predict
 features_scaled = scaler.transform(features)
 dmatrix = xgb.DMatrix(features_scaled)
 raw_pred = model.predict(dmatrix)
 win_probability = calibrator.predict(raw_pred) * 100
 ```
 ## Related Models
+| Model | Description |
+|-------|-------------|
+| [grantpilot-embedding-v2](https://huggingface.co/ArkMaster123/grantpilot-embedding-v2) | V2 embedding (required for cosine_similarity feature) |
+| [grantpilot-embedding](https://huggingface.co/ArkMaster123/grantpilot-embedding) | V1 — federal-only, beats OpenAI on retrieval |
+| [grantpilot-classifier](https://huggingface.co/ArkMaster123/grantpilot-classifier) | V1 — federal-only classifier (AUC 0.837) |
+| [grantpilot-training-data](https://huggingface.co/datasets/ArkMaster123/grantpilot-training-data) | Training data (V1 at training/, V2 at training_v2/) |