Update README with honest benchmark results and V1 comparison

661d4eb verified 25 days ago

4.78 kB

	---
	license: apache-2.0
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- grant-matching
	- nonprofit
	- foundation-grants
	base_model: Qwen/Qwen3-Embedding-0.6B
	datasets:
	- ArkMaster123/grantpilot-training-data
	language:
	- en
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	---

	# GrantPilot Embedding V2 (Federal + Foundation)

	Fine-tuned [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) for grant-organization semantic matching. V2 extends coverage from federal-only (NIH/NSF) to include 37,684 private foundations.

	> See also: [V1 (federal-only)](https://huggingface.co/ArkMaster123/grantpilot-embedding) which outperforms OpenAI on federal grant retrieval.

	## Embedding Benchmark Results

	Benchmarked on 998 test pairs (901 foundation, 78 NIH, 19 NSF) using retrieval and classification metrics.

	### Retrieval Quality

	\| Model \| Dim \| R@1 \| R@5 \| R@10 \| MRR \| NDCG@10 \|
	\|-------\|-----\|-----\|-----\|------\|-----\|---------\|
	\| OpenAI text-embedding-3-small \| 1536 \| 0.343 \| 0.570 \| 0.682 \| 0.453 \| 0.499 \|
	\| Qwen3-Embedding-0.6B (base) \| 1024 \| 0.295 \| 0.514 \| 0.630 \| 0.403 \| 0.449 \|
	\| GrantPilot V2 (this model) \| 1024 \| 0.295 \| 0.516 \| 0.622 \| 0.403 \| 0.446 \|

	Verdict: OpenAI wins on retrieval. The fine-tuned V2 embedding performs on par with the base Qwen3 model — fine-tuning did not meaningfully improve retrieval on this mixed dataset. V1 (federal-only) significantly outperformed OpenAI on federal retrieval, but adding 90% foundation data diluted that specialization.

	### AUC as Classifier Feature

	\| Model \| Overall AUC \| Foundation AUC \| NIH AUC \| NSF AUC \|
	\|-------\|-------------\|----------------\|---------\|---------\|
	\| OpenAI text-embedding-3-small \| 0.886 \| 0.972 \| 0.473 \| 0.524 \|
	\| Qwen3-Embedding-0.6B (base) \| 0.881 \| 0.965 \| 0.611 \| 0.548 \|
	\| GrantPilot V2 (this model) \| 0.881 \| 0.965 \| 0.614 \| 0.548 \|

	Interesting: OpenAI has the best overall AUC but worst federal AUC (0.47 on NIH — worse than random). Our fine-tuned model is best on federal grants.

	### Inference Latency

	\| Model \| Avg Latency \| Cost \|
	\|-------\|-------------\|------\|
	\| OpenAI text-embedding-3-small \| 43.9ms \| API cost \|
	\| Qwen3-Embedding-0.6B (base) \| 2.9ms \| Free (self-hosted) \|
	\| GrantPilot V2 (this model) \| 1.7ms \| Free (self-hosted) \|

	25x faster than OpenAI with zero API cost.

	### Comparison with V1

	\| Metric \| V1 vs OpenAI \| V2 vs OpenAI \|
	\|--------\|-------------\|-------------\|
	\| R@1 \| V1 wins (+46%) \| OpenAI wins \|
	\| R@5 \| V1 wins (+22%) \| OpenAI wins \|
	\| R@10 \| V1 wins (+28%) \| OpenAI wins \|

	V1 beat OpenAI decisively on federal grants. V2 lost that edge by training on a dataset that is 90% foundation data.

	## Why Use This Model?

	The embedding alone is not the star — the XGBoost classifier built on top is where the real value comes from:

	\| Classifier Metric \| V1 \| V2 \|
	\|-------------------\|----\|----\|
	\| Overall AUC \| 0.837 \| 0.997 \|
	\| Federal AUC \| 0.837 \| 0.913 \|
	\| Accuracy \| 72.1% \| 98.3% \|
	\| F1 \| 0.595 \| 0.983 \|

	See: [grantpilot-classifier-v2](https://huggingface.co/ArkMaster123/grantpilot-classifier-v2)

	## Training Details

	- Hardware: NVIDIA H100 80GB
	- Training Steps: 1,000 (LoRA fine-tuning)
	- Training Pairs: 324,479 positive pairs
	- LoRA Config: r=16, alpha=32, target=q/k/v/o projections
	- Batch Size: 32 (x4 gradient accumulation = 128 effective)
	- Learning Rate: 2e-5
	- Final Val Loss: 0.1458

	### Training Data Composition

	\| Source \| Pairs \| % \|
	\|--------\|-------\|---\|
	\| Foundation (990-PF) \| 292,401 \| 90.1% \|
	\| NIH \| 25,717 \| 7.9% \|
	\| NSF \| 6,361 \| 2.0% \|

	## Usage

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("ArkMaster123/grantpilot-embedding-v2", trust_remote_code=True)

	org_text = "Organization: Ford Foundation\nLocation: New York, NY\nType: FOUNDATION"
	grant_text = "Grant: Support for civil society organizations\nAmount: $500,000"

	embeddings = model.encode([org_text, grant_text])
	similarity = embeddings[0] @ embeddings[1]
	```

	## Related Models

	\| Model \| Description \|
	\|-------\|-------------\|
	\| [grantpilot-embedding](https://huggingface.co/ArkMaster123/grantpilot-embedding) \| V1 — federal-only, beats OpenAI on retrieval \|
	\| [grantpilot-classifier](https://huggingface.co/ArkMaster123/grantpilot-classifier) \| V1 — federal-only classifier (AUC 0.837) \|
	\| [grantpilot-classifier-v2](https://huggingface.co/ArkMaster123/grantpilot-classifier-v2) \| V2 — combined classifier (AUC 0.997) \|
	\| [grantpilot-training-data](https://huggingface.co/datasets/ArkMaster123/grantpilot-training-data) \| Training data (V1 at training/, V2 at training_v2/) \|

	---
	license: apache-2.0
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- grant-matching
	- nonprofit
	- foundation-grants
	base_model: Qwen/Qwen3-Embedding-0.6B
	datasets:
	- ArkMaster123/grantpilot-training-data
	language:
	- en
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	---

	# GrantPilot Embedding V2 (Federal + Foundation)

	Fine-tuned [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) for grant-organization semantic matching. V2 extends coverage from federal-only (NIH/NSF) to include 37,684 private foundations.

	> See also: [V1 (federal-only)](https://huggingface.co/ArkMaster123/grantpilot-embedding) which outperforms OpenAI on federal grant retrieval.

	## Embedding Benchmark Results

	Benchmarked on 998 test pairs (901 foundation, 78 NIH, 19 NSF) using retrieval and classification metrics.

	### Retrieval Quality

	\| Model \| Dim \| R@1 \| R@5 \| R@10 \| MRR \| NDCG@10 \|
	\|-------\|-----\|-----\|-----\|------\|-----\|---------\|
	\| OpenAI text-embedding-3-small \| 1536 \| 0.343 \| 0.570 \| 0.682 \| 0.453 \| 0.499 \|
	\| Qwen3-Embedding-0.6B (base) \| 1024 \| 0.295 \| 0.514 \| 0.630 \| 0.403 \| 0.449 \|
	\| GrantPilot V2 (this model) \| 1024 \| 0.295 \| 0.516 \| 0.622 \| 0.403 \| 0.446 \|

	Verdict: OpenAI wins on retrieval. The fine-tuned V2 embedding performs on par with the base Qwen3 model — fine-tuning did not meaningfully improve retrieval on this mixed dataset. V1 (federal-only) significantly outperformed OpenAI on federal retrieval, but adding 90% foundation data diluted that specialization.

	### AUC as Classifier Feature

	\| Model \| Overall AUC \| Foundation AUC \| NIH AUC \| NSF AUC \|
	\|-------\|-------------\|----------------\|---------\|---------\|
	\| OpenAI text-embedding-3-small \| 0.886 \| 0.972 \| 0.473 \| 0.524 \|
	\| Qwen3-Embedding-0.6B (base) \| 0.881 \| 0.965 \| 0.611 \| 0.548 \|
	\| GrantPilot V2 (this model) \| 0.881 \| 0.965 \| 0.614 \| 0.548 \|

	Interesting: OpenAI has the best overall AUC but worst federal AUC (0.47 on NIH — worse than random). Our fine-tuned model is best on federal grants.

	### Inference Latency

	\| Model \| Avg Latency \| Cost \|
	\|-------\|-------------\|------\|
	\| OpenAI text-embedding-3-small \| 43.9ms \| API cost \|
	\| Qwen3-Embedding-0.6B (base) \| 2.9ms \| Free (self-hosted) \|
	\| GrantPilot V2 (this model) \| 1.7ms \| Free (self-hosted) \|

	25x faster than OpenAI with zero API cost.

	### Comparison with V1

	\| Metric \| V1 vs OpenAI \| V2 vs OpenAI \|
	\|--------\|-------------\|-------------\|
	\| R@1 \| V1 wins (+46%) \| OpenAI wins \|
	\| R@5 \| V1 wins (+22%) \| OpenAI wins \|
	\| R@10 \| V1 wins (+28%) \| OpenAI wins \|

	V1 beat OpenAI decisively on federal grants. V2 lost that edge by training on a dataset that is 90% foundation data.

	## Why Use This Model?

	The embedding alone is not the star — the XGBoost classifier built on top is where the real value comes from:

	\| Classifier Metric \| V1 \| V2 \|
	\|-------------------\|----\|----\|
	\| Overall AUC \| 0.837 \| 0.997 \|
	\| Federal AUC \| 0.837 \| 0.913 \|
	\| Accuracy \| 72.1% \| 98.3% \|
	\| F1 \| 0.595 \| 0.983 \|

	See: [grantpilot-classifier-v2](https://huggingface.co/ArkMaster123/grantpilot-classifier-v2)

	## Training Details

	- Hardware: NVIDIA H100 80GB
	- Training Steps: 1,000 (LoRA fine-tuning)
	- Training Pairs: 324,479 positive pairs
	- LoRA Config: r=16, alpha=32, target=q/k/v/o projections
	- Batch Size: 32 (x4 gradient accumulation = 128 effective)
	- Learning Rate: 2e-5
	- Final Val Loss: 0.1458

	### Training Data Composition

	\| Source \| Pairs \| % \|
	\|--------\|-------\|---\|
	\| Foundation (990-PF) \| 292,401 \| 90.1% \|
	\| NIH \| 25,717 \| 7.9% \|
	\| NSF \| 6,361 \| 2.0% \|

	## Usage

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("ArkMaster123/grantpilot-embedding-v2", trust_remote_code=True)

	org_text = "Organization: Ford Foundation\nLocation: New York, NY\nType: FOUNDATION"
	grant_text = "Grant: Support for civil society organizations\nAmount: $500,000"

	embeddings = model.encode([org_text, grant_text])
	similarity = embeddings[0] @ embeddings[1]
	```

	## Related Models

	\| Model \| Description \|
	\|-------\|-------------\|
	\| [grantpilot-embedding](https://huggingface.co/ArkMaster123/grantpilot-embedding) \| V1 — federal-only, beats OpenAI on retrieval \|
	\| [grantpilot-classifier](https://huggingface.co/ArkMaster123/grantpilot-classifier) \| V1 — federal-only classifier (AUC 0.837) \|
	\| [grantpilot-classifier-v2](https://huggingface.co/ArkMaster123/grantpilot-classifier-v2) \| V2 — combined classifier (AUC 0.997) \|
	\| [grantpilot-training-data](https://huggingface.co/datasets/ArkMaster123/grantpilot-training-data) \| Training data (V1 at training/, V2 at training_v2/) \|