---
license: apache-2.0
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - grant-matching
  - nonprofit
  - foundation-grants
base_model: Qwen/Qwen3-Embedding-0.6B
datasets:
  - ArkMaster123/grantpilot-training-data
language:
  - en
pipeline_tag: sentence-similarity
library_name: sentence-transformers
---

# GrantPilot Embedding V2 (Federal + Foundation)

Fine-tuned [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) for grant-organization semantic matching. V2 extends coverage from federal-only (NIH/NSF) to include **37,684 private foundations**.

> **See also:** [V1 (federal-only)](https://huggingface.co/ArkMaster123/grantpilot-embedding) which outperforms OpenAI on federal grant retrieval.

## Embedding Benchmark Results

Benchmarked on 998 test pairs (901 foundation, 78 NIH, 19 NSF) using retrieval and classification metrics.

### Retrieval Quality

| Model | Dim | R@1 | R@5 | R@10 | MRR | NDCG@10 |
|-------|-----|-----|-----|------|-----|---------|
| OpenAI text-embedding-3-small | 1536 | **0.343** | **0.570** | **0.682** | **0.453** | **0.499** |
| Qwen3-Embedding-0.6B (base) | 1024 | 0.295 | 0.514 | 0.630 | 0.403 | 0.449 |
| **GrantPilot V2 (this model)** | 1024 | 0.295 | 0.516 | 0.622 | 0.403 | 0.446 |

**Verdict: OpenAI wins on retrieval.** The fine-tuned V2 embedding performs on par with the base Qwen3 model — fine-tuning did not meaningfully improve retrieval on this mixed dataset. V1 (federal-only) significantly outperformed OpenAI on federal retrieval, but adding 90% foundation data diluted that specialization.

### AUC as Classifier Feature

| Model | Overall AUC | Foundation AUC | NIH AUC | NSF AUC |
|-------|-------------|----------------|---------|---------|
| OpenAI text-embedding-3-small | **0.886** | **0.972** | 0.473 | 0.524 |
| Qwen3-Embedding-0.6B (base) | 0.881 | 0.965 | 0.611 | 0.548 |
| **GrantPilot V2 (this model)** | 0.881 | 0.965 | **0.614** | 0.548 |

Interesting: OpenAI has the best overall AUC but **worst federal AUC** (0.47 on NIH — worse than random). Our fine-tuned model is best on federal grants.

### Inference Latency

| Model | Avg Latency | Cost |
|-------|-------------|------|
| OpenAI text-embedding-3-small | 43.9ms | API cost |
| Qwen3-Embedding-0.6B (base) | 2.9ms | Free (self-hosted) |
| **GrantPilot V2 (this model)** | **1.7ms** | Free (self-hosted) |

**25x faster than OpenAI** with zero API cost.

### Comparison with V1

| Metric | V1 vs OpenAI | V2 vs OpenAI |
|--------|-------------|-------------|
| R@1 | **V1 wins (+46%)** | OpenAI wins |
| R@5 | **V1 wins (+22%)** | OpenAI wins |
| R@10 | **V1 wins (+28%)** | OpenAI wins |

V1 beat OpenAI decisively on federal grants. V2 lost that edge by training on a dataset that is 90% foundation data.

## Why Use This Model?

The embedding alone is not the star — the **XGBoost classifier built on top** is where the real value comes from:

| Classifier Metric | V1 | V2 |
|-------------------|----|----|
| Overall AUC | 0.837 | **0.997** |
| Federal AUC | 0.837 | **0.913** |
| Accuracy | 72.1% | **98.3%** |
| F1 | 0.595 | **0.983** |

See: [grantpilot-classifier-v2](https://huggingface.co/ArkMaster123/grantpilot-classifier-v2)

## Training Details

- **Hardware**: NVIDIA H100 80GB
- **Training Steps**: 1,000 (LoRA fine-tuning)
- **Training Pairs**: 324,479 positive pairs
- **LoRA Config**: r=16, alpha=32, target=q/k/v/o projections
- **Batch Size**: 32 (x4 gradient accumulation = 128 effective)
- **Learning Rate**: 2e-5
- **Final Val Loss**: 0.1458

### Training Data Composition

| Source | Pairs | % |
|--------|-------|---|
| Foundation (990-PF) | 292,401 | 90.1% |
| NIH | 25,717 | 7.9% |
| NSF | 6,361 | 2.0% |

## Usage

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ArkMaster123/grantpilot-embedding-v2", trust_remote_code=True)

org_text = "Organization: Ford Foundation\nLocation: New York, NY\nType: FOUNDATION"
grant_text = "Grant: Support for civil society organizations\nAmount: $500,000"

embeddings = model.encode([org_text, grant_text])
similarity = embeddings[0] @ embeddings[1]
```

## Related Models

| Model | Description |
|-------|-------------|
| [grantpilot-embedding](https://huggingface.co/ArkMaster123/grantpilot-embedding) | V1 — federal-only, beats OpenAI on retrieval |
| [grantpilot-classifier](https://huggingface.co/ArkMaster123/grantpilot-classifier) | V1 — federal-only classifier (AUC 0.837) |
| [grantpilot-classifier-v2](https://huggingface.co/ArkMaster123/grantpilot-classifier-v2) | V2 — combined classifier (AUC 0.997) |
| [grantpilot-training-data](https://huggingface.co/datasets/ArkMaster123/grantpilot-training-data) | Training data (V1 at training/, V2 at training_v2/) |