Update README with honest benchmark results and V1 comparison

661d4eb verified 24 days ago

4.78 kB

license: apache-2.0
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - grant-matching
  - nonprofit
  - foundation-grants
base_model: Qwen/Qwen3-Embedding-0.6B
datasets:
  - ArkMaster123/grantpilot-training-data
language:
  - en
pipeline_tag: sentence-similarity
library_name: sentence-transformers

GrantPilot Embedding V2 (Federal + Foundation)

Fine-tuned Qwen3-Embedding-0.6B for grant-organization semantic matching. V2 extends coverage from federal-only (NIH/NSF) to include 37,684 private foundations.

See also: V1 (federal-only) which outperforms OpenAI on federal grant retrieval.

Embedding Benchmark Results

Benchmarked on 998 test pairs (901 foundation, 78 NIH, 19 NSF) using retrieval and classification metrics.

Retrieval Quality

Model	Dim	R@1	R@5	R@10	MRR	NDCG@10
OpenAI text-embedding-3-small	1536	0.343	0.570	0.682	0.453	0.499
Qwen3-Embedding-0.6B (base)	1024	0.295	0.514	0.630	0.403	0.449
GrantPilot V2 (this model)	1024	0.295	0.516	0.622	0.403	0.446

Verdict: OpenAI wins on retrieval. The fine-tuned V2 embedding performs on par with the base Qwen3 model — fine-tuning did not meaningfully improve retrieval on this mixed dataset. V1 (federal-only) significantly outperformed OpenAI on federal retrieval, but adding 90% foundation data diluted that specialization.

AUC as Classifier Feature

Model	Overall AUC	Foundation AUC	NIH AUC	NSF AUC
OpenAI text-embedding-3-small	0.886	0.972	0.473	0.524
Qwen3-Embedding-0.6B (base)	0.881	0.965	0.611	0.548
GrantPilot V2 (this model)	0.881	0.965	0.614	0.548

Interesting: OpenAI has the best overall AUC but worst federal AUC (0.47 on NIH — worse than random). Our fine-tuned model is best on federal grants.

Inference Latency

Model	Avg Latency	Cost
OpenAI text-embedding-3-small	43.9ms	API cost
Qwen3-Embedding-0.6B (base)	2.9ms	Free (self-hosted)
GrantPilot V2 (this model)	1.7ms	Free (self-hosted)

25x faster than OpenAI with zero API cost.

Comparison with V1

Metric	V1 vs OpenAI	V2 vs OpenAI
R@1	V1 wins (+46%)	OpenAI wins
R@5	V1 wins (+22%)	OpenAI wins
R@10	V1 wins (+28%)	OpenAI wins

V1 beat OpenAI decisively on federal grants. V2 lost that edge by training on a dataset that is 90% foundation data.

Why Use This Model?

The embedding alone is not the star — the XGBoost classifier built on top is where the real value comes from:

Classifier Metric	V1	V2
Overall AUC	0.837	0.997
Federal AUC	0.837	0.913
Accuracy	72.1%	98.3%
F1	0.595	0.983

See: grantpilot-classifier-v2

Training Details

Hardware: NVIDIA H100 80GB
Training Steps: 1,000 (LoRA fine-tuning)
Training Pairs: 324,479 positive pairs
LoRA Config: r=16, alpha=32, target=q/k/v/o projections
Batch Size: 32 (x4 gradient accumulation = 128 effective)
Learning Rate: 2e-5
Final Val Loss: 0.1458

Training Data Composition

Source	Pairs	%
Foundation (990-PF)	292,401	90.1%
NIH	25,717	7.9%
NSF	6,361	2.0%

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ArkMaster123/grantpilot-embedding-v2", trust_remote_code=True)

org_text = "Organization: Ford Foundation\nLocation: New York, NY\nType: FOUNDATION"
grant_text = "Grant: Support for civil society organizations\nAmount: $500,000"

embeddings = model.encode([org_text, grant_text])
similarity = embeddings[0] @ embeddings[1]

Related Models

Model	Description
grantpilot-embedding	V1 — federal-only, beats OpenAI on retrieval
grantpilot-classifier	V1 — federal-only classifier (AUC 0.837)
grantpilot-classifier-v2	V2 — combined classifier (AUC 0.997)
grantpilot-training-data	Training data (V1 at training/, V2 at training_v2/)