| --- |
| license: apache-2.0 |
| tags: |
| - sentence-transformers |
| - sentence-similarity |
| - feature-extraction |
| - grant-matching |
| - nonprofit |
| - foundation-grants |
| base_model: Qwen/Qwen3-Embedding-0.6B |
| datasets: |
| - ArkMaster123/grantpilot-training-data |
| language: |
| - en |
| pipeline_tag: sentence-similarity |
| library_name: sentence-transformers |
| --- |
| |
| # GrantPilot Embedding V2 (Federal + Foundation) |
|
|
| Fine-tuned [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) for grant-organization semantic matching. V2 extends coverage from federal-only (NIH/NSF) to include **37,684 private foundations**. |
|
|
| > **See also:** [V1 (federal-only)](https://huggingface.co/ArkMaster123/grantpilot-embedding) which outperforms OpenAI on federal grant retrieval. |
|
|
| ## Embedding Benchmark Results |
|
|
| Benchmarked on 998 test pairs (901 foundation, 78 NIH, 19 NSF) using retrieval and classification metrics. |
|
|
| ### Retrieval Quality |
|
|
| | Model | Dim | R@1 | R@5 | R@10 | MRR | NDCG@10 | |
| |-------|-----|-----|-----|------|-----|---------| |
| | OpenAI text-embedding-3-small | 1536 | **0.343** | **0.570** | **0.682** | **0.453** | **0.499** | |
| | Qwen3-Embedding-0.6B (base) | 1024 | 0.295 | 0.514 | 0.630 | 0.403 | 0.449 | |
| | **GrantPilot V2 (this model)** | 1024 | 0.295 | 0.516 | 0.622 | 0.403 | 0.446 | |
|
|
| **Verdict: OpenAI wins on retrieval.** The fine-tuned V2 embedding performs on par with the base Qwen3 model β fine-tuning did not meaningfully improve retrieval on this mixed dataset. V1 (federal-only) significantly outperformed OpenAI on federal retrieval, but adding 90% foundation data diluted that specialization. |
|
|
| ### AUC as Classifier Feature |
|
|
| | Model | Overall AUC | Foundation AUC | NIH AUC | NSF AUC | |
| |-------|-------------|----------------|---------|---------| |
| | OpenAI text-embedding-3-small | **0.886** | **0.972** | 0.473 | 0.524 | |
| | Qwen3-Embedding-0.6B (base) | 0.881 | 0.965 | 0.611 | 0.548 | |
| | **GrantPilot V2 (this model)** | 0.881 | 0.965 | **0.614** | 0.548 | |
|
|
| Interesting: OpenAI has the best overall AUC but **worst federal AUC** (0.47 on NIH β worse than random). Our fine-tuned model is best on federal grants. |
|
|
| ### Inference Latency |
|
|
| | Model | Avg Latency | Cost | |
| |-------|-------------|------| |
| | OpenAI text-embedding-3-small | 43.9ms | API cost | |
| | Qwen3-Embedding-0.6B (base) | 2.9ms | Free (self-hosted) | |
| | **GrantPilot V2 (this model)** | **1.7ms** | Free (self-hosted) | |
|
|
| **25x faster than OpenAI** with zero API cost. |
|
|
| ### Comparison with V1 |
|
|
| | Metric | V1 vs OpenAI | V2 vs OpenAI | |
| |--------|-------------|-------------| |
| | R@1 | **V1 wins (+46%)** | OpenAI wins | |
| | R@5 | **V1 wins (+22%)** | OpenAI wins | |
| | R@10 | **V1 wins (+28%)** | OpenAI wins | |
|
|
| V1 beat OpenAI decisively on federal grants. V2 lost that edge by training on a dataset that is 90% foundation data. |
|
|
| ## Why Use This Model? |
|
|
| The embedding alone is not the star β the **XGBoost classifier built on top** is where the real value comes from: |
|
|
| | Classifier Metric | V1 | V2 | |
| |-------------------|----|----| |
| | Overall AUC | 0.837 | **0.997** | |
| | Federal AUC | 0.837 | **0.913** | |
| | Accuracy | 72.1% | **98.3%** | |
| | F1 | 0.595 | **0.983** | |
|
|
| See: [grantpilot-classifier-v2](https://huggingface.co/ArkMaster123/grantpilot-classifier-v2) |
|
|
| ## Training Details |
|
|
| - **Hardware**: NVIDIA H100 80GB |
| - **Training Steps**: 1,000 (LoRA fine-tuning) |
| - **Training Pairs**: 324,479 positive pairs |
| - **LoRA Config**: r=16, alpha=32, target=q/k/v/o projections |
| - **Batch Size**: 32 (x4 gradient accumulation = 128 effective) |
| - **Learning Rate**: 2e-5 |
| - **Final Val Loss**: 0.1458 |
|
|
| ### Training Data Composition |
|
|
| | Source | Pairs | % | |
| |--------|-------|---| |
| | Foundation (990-PF) | 292,401 | 90.1% | |
| | NIH | 25,717 | 7.9% | |
| | NSF | 6,361 | 2.0% | |
|
|
| ## Usage |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer |
| |
| model = SentenceTransformer("ArkMaster123/grantpilot-embedding-v2", trust_remote_code=True) |
| |
| org_text = "Organization: Ford Foundation\nLocation: New York, NY\nType: FOUNDATION" |
| grant_text = "Grant: Support for civil society organizations\nAmount: $500,000" |
| |
| embeddings = model.encode([org_text, grant_text]) |
| similarity = embeddings[0] @ embeddings[1] |
| ``` |
|
|
| ## Related Models |
|
|
| | Model | Description | |
| |-------|-------------| |
| | [grantpilot-embedding](https://huggingface.co/ArkMaster123/grantpilot-embedding) | V1 β federal-only, beats OpenAI on retrieval | |
| | [grantpilot-classifier](https://huggingface.co/ArkMaster123/grantpilot-classifier) | V1 β federal-only classifier (AUC 0.837) | |
| | [grantpilot-classifier-v2](https://huggingface.co/ArkMaster123/grantpilot-classifier-v2) | V2 β combined classifier (AUC 0.997) | |
| | [grantpilot-training-data](https://huggingface.co/datasets/ArkMaster123/grantpilot-training-data) | Training data (V1 at training/, V2 at training_v2/) | |
| |