ArkMaster123 commited on
Commit
661d4eb
·
verified ·
1 Parent(s): 33709a2

Update README with honest benchmark results and V1 comparison

Browse files
Files changed (1) hide show
  1. README.md +82 -656
README.md CHANGED
@@ -1,695 +1,121 @@
1
  ---
 
2
  tags:
3
- - sentence-transformers
4
- - sentence-similarity
5
- - feature-extraction
6
- - dense
7
- - generated_from_trainer
8
- - dataset_size:324479
9
- - loss:MultipleNegativesRankingLoss
10
  base_model: Qwen/Qwen3-Embedding-0.6B
11
- widget:
12
- - source_sentence: 'Organization: PKF O''CONNOR DAVIES ADVISORY LLC
13
-
14
- Location: NEW YORK, NY
15
-
16
- Type: FOUNDATION'
17
- sentences:
18
- - 'Grant: Grant to OCEANA INC
19
-
20
- Funder: PKF O''CONNOR DAVIES ADVISORY LLC (FOUNDATION)
21
-
22
- Amount: $150,000
23
-
24
- Description: Purpose: TO SUPPORT OCEANA''S WORK IN THE UK
25
-
26
- Recipient Location: WASHINGTON, DC
27
-
28
- Recipient Type: Public Charity
29
-
30
- Amount: $150,000'
31
- - 'Grant: Grant to RAINFOREST FOUNDATION INC
32
-
33
- Funder: BPM LLP (FOUNDATION)
34
-
35
- Amount: $100,000
36
-
37
- Description: Purpose: RAPID RESPONSE ADDRESSING THE NEEDS OF COMMUNITIES AFFECTED
38
- BY THE FIRES IN BELIZE.
39
-
40
- Recipient Location: BROOKLYN, NY
41
-
42
- Recipient Type: Public Charity
43
-
44
- Amount: $100,000'
45
- - 'Grant: Grant to ALONDRA ALVAREZ MURILLO
46
-
47
- Funder: REDWITZ INC (FOUNDATION)
48
-
49
- Amount: $300
50
-
51
- Description: Purpose: TEACHER GRATITUDE GRANT
52
-
53
- Recipient Location: EL CERRITO, CA
54
-
55
- Recipient Type: EDUCATIONAL INSTITUT
56
-
57
- Amount: $300'
58
- - source_sentence: 'Organization: Forvis Mazars LLP
59
-
60
- Location: Asheville, NC
61
-
62
- Type: FOUNDATION'
63
- sentences:
64
- - 'Grant: Grant to Globe Santa - The Boston Globe Foundation
65
-
66
- Funder: Forvis Mazars LLP (FOUNDATION)
67
-
68
- Amount: $2,000
69
-
70
- Description: Purpose: To provide general support
71
-
72
- Recipient Location: Boston, MA
73
-
74
- Recipient Type: Public Charity
75
-
76
- Amount: $2,000'
77
- - 'Grant: Grant to TRIBAL ECO RESTORATION ALLIANCE
78
-
79
- Funder: Foundation Source (FOUNDATION)
80
-
81
- Amount: $20,000
82
-
83
- Description: Purpose: General & Unrestricted
84
-
85
- Recipient Location: UPPER LAKE, CA
86
-
87
- Recipient Type: Public Charity
88
-
89
- Amount: $20,000'
90
- - 'Grant: Assessing the spatial and temporal scales of attention effects and attention-dependent
91
- cholinergic release in macque V4.
92
-
93
- Funder: National Eye Institute (FEDERAL)
94
-
95
- Amount: $41,749
96
-
97
- Description: Explicitly or implicitly, there are currently three competing models
98
- for the role of the neuromodulator acetylcholine (ACh) in attention. The first
99
- asserts that the cholinergic system is spatially imprecise and contributes to
100
- a mechanism for arousal but not attention. The second states that the cholinergic
101
- system is spatially imprecise and is one component of the mechanism for attention.
102
- The third states that the cholinergic system is at the center of the mechanism
103
- for attention (implying the sy...'
104
- - source_sentence: 'Organization: WITHUMSMITHBROWNPC
105
-
106
- Location: NEW YORK, NY
107
-
108
- Type: FOUNDATION'
109
- sentences:
110
- - 'Grant: Grant to XERCES SOCIETY INC
111
-
112
- Funder: WEAVER AND TIDWELL LLP (FOUNDATION)
113
-
114
- Amount: $200
115
-
116
- Description: Purpose: TO FURTHER THE ORGANIZATIONS CHARITABLE OBJECTIVES
117
-
118
- Recipient Location: NEW YORK, NY
119
-
120
- Recipient Type: EXEMPT
121
-
122
- Amount: $200'
123
- - 'Grant: Grant to NOOGA QUEEN BEE COOPERATIVE
124
-
125
- Funder: HEMENWAY & BARNES LLP (FOUNDATION)
126
-
127
- Amount: $1,528
128
-
129
- Description: Purpose: FURTHERING EDUCATION WITH RESPECT TO SCIENCE POLICY AND
130
- BEEKEEPING.
131
-
132
- Recipient Location: CHATTANOOGA, TN
133
-
134
- Recipient Type: Non-Charity
135
-
136
- Amount: $1,528'
137
- - 'Grant: Grant to Institute for Ag & Trade Policy
138
-
139
- Funder: WITHUMSMITHBROWNPC (FOUNDATION)
140
-
141
- Amount: $30,000
142
-
143
- Description: Purpose: Transform Food Systems
144
-
145
- Recipient Location: Minneapolis, MN
146
-
147
- Recipient Type: Public Charity
148
-
149
- Amount: $30,000'
150
- - source_sentence: 'Organization: GRANT THORNTON ADVISORS LLC
151
-
152
- Location: BOSTON, MA
153
-
154
- Type: FOUNDATION'
155
- sentences:
156
- - 'Grant: Grant to SAN JUAN ROTARY FOUNDATION INC
157
-
158
- Funder: PKF O''CONNOR DAVIES ADVISORY LLC (FOUNDATION)
159
-
160
- Amount: $2,000
161
-
162
- Description: Purpose: VOLUNTEER INCENTIVE PROGRAM
163
-
164
- Recipient Location: FARMINGTON, NM
165
-
166
- Recipient Type: Public Charity
167
-
168
- Amount: $2,000'
169
- - 'Grant: Grant to BROWN UNIVERSITY
170
-
171
- Funder: GRANT THORNTON ADVISORS LLC (FOUNDATION)
172
-
173
- Amount: $400
174
-
175
- Description: Purpose: FIDELITY MATCHING GIFTS TO EDUCATION
176
-
177
- Recipient Location: PROVIDENCE, RI
178
-
179
- Recipient Type: Public Charity
180
-
181
- Amount: $400'
182
- - 'Grant: Experimental Study of a Model to Support Research Evidence Use for Protecting
183
- Children
184
-
185
- Funder: Eunice Kennedy Shriver National Institute of Child Health and Human Development
186
- (FEDERAL)
187
-
188
- Amount: $689,752
189
-
190
- Description: Project Summary Protecting children through the primary prevention
191
- of child abuse and neglect (CAN) is a major priority given that an estimated 1
192
- in 7 children are affected each year in the U.S. and the societal cost of CAN
193
- is of over $400 billion. Even though there are many evidence-based programs to
194
- prevent abuse, reduce harm, and treat trauma, there remain numerous barriers for
195
- policymakers to craft scientifically-informed policies to protect children. Accordingly,
196
- we propose an experimental ...'
197
- - source_sentence: 'Organization: WITHUMSMITHBROWNPC
198
-
199
- Location: IRVINE, CA
200
-
201
- Type: FOUNDATION'
202
- sentences:
203
- - 'Grant: Grant to CENTER FOR LEADERSHIP DEVELOPMENT
204
-
205
- Funder: BGBC ADVISORY LLC (FOUNDATION)
206
-
207
- Amount: $1,000
208
-
209
- Description: Purpose: TO FOSTER THE ADVANCEMENT OF MINORITY YOUTH IN CENTRAL INDIANA
210
- AS FUTURE PROFESSIONAL, BUSINESS AND COMMUNITY LEADERS BY PROVIDING EXPERIENCES
211
- THAT ENCOURAGE PERSONAL DEVELOPMENT AND EDUCATIONAL ATTAINMENT.
212
-
213
- Recipient Location: INDIANAPOLIS, IN
214
-
215
- Recipient Type: PUBLIC CHARITY
216
-
217
- Amount: $1,000'
218
- - 'Grant: Grant to Santa Barbara Botanic Garden
219
-
220
- Funder: WITHUMSMITHBROWNPC (FOUNDATION)
221
-
222
- Amount: $2,150
223
-
224
- Description: Purpose: TO FURTHER THE AGENDA OF THE ORGANIZATION.
225
-
226
- Recipient Location: Santa Barbara, CA
227
-
228
- Recipient Type: Public Charity
229
-
230
- Amount: $2,150'
231
- - 'Grant: Grant to INTERNATIONAL RESCUE COMMITTEE INC
232
-
233
- Funder: CLARK NUBER PS (FOUNDATION)
234
-
235
- Amount: $200,000
236
-
237
- Description: Purpose: ENSURING THE RIGHT TO HUMANITARIAN ASSISTANCE IN EAST AFRICA
238
-
239
- Recipient Location: NEW YORK, NY
240
-
241
- Recipient Type: Public Charity
242
-
243
- Amount: $200,000'
244
  pipeline_tag: sentence-similarity
245
  library_name: sentence-transformers
246
- metrics:
247
- - pearson_cosine
248
- - spearman_cosine
249
- model-index:
250
- - name: SentenceTransformer based on Qwen/Qwen3-Embedding-0.6B
251
- results:
252
- - task:
253
- type: semantic-similarity
254
- name: Semantic Similarity
255
- dataset:
256
- name: val similarity
257
- type: val-similarity
258
- metrics:
259
- - type: pearson_cosine
260
- value: .nan
261
- name: Pearson Cosine
262
- - type: spearman_cosine
263
- value: .nan
264
- name: Spearman Cosine
265
  ---
266
 
267
- # SentenceTransformer based on Qwen/Qwen3-Embedding-0.6B
268
 
269
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
270
 
271
- ## Model Details
272
 
273
- ### Model Description
274
- - **Model Type:** Sentence Transformer
275
- - **Base model:** [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) <!-- at revision c54f2e6e80b2d7b7de06f51cec4959f6b3e03418 -->
276
- - **Maximum Sequence Length:** 512 tokens
277
- - **Output Dimensionality:** 1024 dimensions
278
- - **Similarity Function:** Cosine Similarity
279
- <!-- - **Training Dataset:** Unknown -->
280
- <!-- - **Language:** Unknown -->
281
- <!-- - **License:** Unknown -->
282
 
283
- ### Model Sources
284
 
285
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
286
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
287
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
288
 
289
- ### Full Model Architecture
 
 
 
 
290
 
291
- ```
292
- SentenceTransformer(
293
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'PeftModelForFeatureExtraction'})
294
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
295
- (2): Normalize()
296
- )
297
- ```
298
 
299
- ## Usage
300
-
301
- ### Direct Usage (Sentence Transformers)
302
-
303
- First install the Sentence Transformers library:
304
-
305
- ```bash
306
- pip install -U sentence-transformers
307
- ```
308
 
309
- Then you can load this model and run inference.
310
- ```python
311
- from sentence_transformers import SentenceTransformer
 
 
312
 
313
- # Download from the 🤗 Hub
314
- model = SentenceTransformer("sentence_transformers_model_id")
315
- # Run inference
316
- queries = [
317
- "Organization: WITHUMSMITHBROWNPC\nLocation: IRVINE, CA\nType: FOUNDATION",
318
- ]
319
- documents = [
320
- 'Grant: Grant to Santa Barbara Botanic Garden\nFunder: WITHUMSMITHBROWNPC (FOUNDATION)\nAmount: $2,150\nDescription: Purpose: TO FURTHER THE AGENDA OF THE ORGANIZATION.\nRecipient Location: Santa Barbara, CA\nRecipient Type: Public Charity\nAmount: $2,150',
321
- 'Grant: Grant to INTERNATIONAL RESCUE COMMITTEE INC\nFunder: CLARK NUBER PS (FOUNDATION)\nAmount: $200,000\nDescription: Purpose: ENSURING THE RIGHT TO HUMANITARIAN ASSISTANCE IN EAST AFRICA\nRecipient Location: NEW YORK, NY\nRecipient Type: Public Charity\nAmount: $200,000',
322
- 'Grant: Grant to CENTER FOR LEADERSHIP DEVELOPMENT\nFunder: BGBC ADVISORY LLC (FOUNDATION)\nAmount: $1,000\nDescription: Purpose: TO FOSTER THE ADVANCEMENT OF MINORITY YOUTH IN CENTRAL INDIANA AS FUTURE PROFESSIONAL, BUSINESS AND COMMUNITY LEADERS BY PROVIDING EXPERIENCES THAT ENCOURAGE PERSONAL DEVELOPMENT AND EDUCATIONAL ATTAINMENT.\nRecipient Location: INDIANAPOLIS, IN\nRecipient Type: PUBLIC CHARITY\nAmount: $1,000',
323
- ]
324
- query_embeddings = model.encode_query(queries)
325
- document_embeddings = model.encode_document(documents)
326
- print(query_embeddings.shape, document_embeddings.shape)
327
- # [1, 1024] [3, 1024]
328
-
329
- # Get the similarity scores for the embeddings
330
- similarities = model.similarity(query_embeddings, document_embeddings)
331
- print(similarities)
332
- # tensor([[0.7437, 0.0331, 0.0600]])
333
- ```
334
 
335
- <!--
336
- ### Direct Usage (Transformers)
337
 
338
- <details><summary>Click to see the direct usage in Transformers</summary>
 
 
 
 
339
 
340
- </details>
341
- -->
342
 
343
- <!--
344
- ### Downstream Usage (Sentence Transformers)
345
 
346
- You can finetune this model on your own dataset.
 
 
 
 
347
 
348
- <details><summary>Click to expand</summary>
349
 
350
- </details>
351
- -->
352
 
353
- <!--
354
- ### Out-of-Scope Use
355
 
356
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
357
- -->
358
-
359
- ## Evaluation
360
-
361
- ### Metrics
362
-
363
- #### Semantic Similarity
364
-
365
- * Dataset: `val-similarity`
366
- * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
367
-
368
- | Metric | Value |
369
- |:--------------------|:--------|
370
- | pearson_cosine | nan |
371
- | **spearman_cosine** | **nan** |
372
-
373
- <!--
374
- ## Bias, Risks and Limitations
375
-
376
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
377
- -->
378
-
379
- <!--
380
- ### Recommendations
381
 
382
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
383
- -->
384
 
385
  ## Training Details
386
 
387
- ### Training Dataset
388
-
389
- #### Unnamed Dataset
390
-
391
- * Size: 324,479 training samples
392
- * Columns: <code>anchor</code> and <code>positive</code>
393
- * Approximate statistics based on the first 1000 samples:
394
- | | anchor | positive |
395
- |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
396
- | type | string | string |
397
- | details | <ul><li>min: 16 tokens</li><li>mean: 23.39 tokens</li><li>max: 41 tokens</li></ul> | <ul><li>min: 46 tokens</li><li>mean: 83.4 tokens</li><li>max: 192 tokens</li></ul> |
398
- * Samples:
399
- | anchor | positive |
400
- |:------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
401
- | <code>Organization: DELOITTE TAX LLP<br>Location: MINNEAPOLIS, MN<br>Type: FOUNDATION</code> | <code>Grant: Grant to WORLD HEALTH ORGANIZATION<br>Funder: DELOITTE TAX LLP (FOUNDATION)<br>Amount: $450,000<br>Description: Purpose: RESEARCH AND LEARNING OPPORTUNITIES<br>Recipient Type: GOV: EXECUTIVE ORDER<br>Amount: $450,000</code> |
402
- | <code>Organization: Berry Dunn McNeil &amp; Parker LLC<br>Location: Portland, ME<br>Type: FOUNDATION</code> | <code>Grant: Grant to Museum of Fine Arts<br>Funder: Berry Dunn McNeil &amp; Parker LLC (FOUNDATION)<br>Amount: $3,000<br>Description: Purpose: Operations budget assistance<br>Recipient Location: Boston, MA<br>Recipient Type: Public Charity<br>Amount: $3,000</code> |
403
- | <code>Organization: Aprio Advisory Group LLC<br>Location: Greenwood Village, CO<br>Type: FOUNDATION</code> | <code>Grant: Grant to Safehouse Denver Inc<br>Funder: Aprio Advisory Group LLC (FOUNDATION)<br>Amount: $5,000<br>Description: Purpose: Survivors of domestic violence<br>Recipient Location: Denver, CO<br>Recipient Type: Public<br>Amount: $5,000</code> |
404
- * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
405
- ```json
406
- {
407
- "scale": 20.0,
408
- "similarity_fct": "cos_sim",
409
- "gather_across_devices": false
410
- }
411
- ```
412
-
413
- ### Evaluation Dataset
414
-
415
- #### Unnamed Dataset
416
-
417
- * Size: 40,559 evaluation samples
418
- * Columns: <code>anchor</code> and <code>positive</code>
419
- * Approximate statistics based on the first 1000 samples:
420
- | | anchor | positive |
421
- |:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
422
- | type | string | string |
423
- | details | <ul><li>min: 16 tokens</li><li>mean: 23.62 tokens</li><li>max: 37 tokens</li></ul> | <ul><li>min: 47 tokens</li><li>mean: 83.31 tokens</li><li>max: 191 tokens</li></ul> |
424
- * Samples:
425
- | anchor | positive |
426
- |:----------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
427
- | <code>Organization: O'CONNOR MALONEY &amp; CO CPA'S<br>Location: WORCESTER, MA<br>Type: FOUNDATION</code> | <code>Grant: Grant to NIKOLAS KOJOIAN<br>Funder: O'CONNOR MALONEY &amp; CO CPA'S (FOUNDATION)<br>Amount: $3,500<br>Description: Purpose: EDUCATIONAL SCHOLARSHIP<br>Recipient Location: NORTH ATTLEBORO, MA<br>Recipient Type: I<br>Amount: $3,500</code> |
428
- | <code>Organization: WALTON ENTERPRISES LLC<br>Location: BENTONVILLE, AR<br>Type: FOUNDATION</code> | <code>Grant: Grant to Student Achievement Partners Inc<br>Funder: WALTON ENTERPRISES LLC (FOUNDATION)<br>Amount: $429,272<br>Description: Purpose: To develop and disseminate high-quality math and literacy instructional materials to educators and publishers that accelerate student learning.<br>Recipient Location: New York, NY<br>Recipient Type: Public Charity<br>Amount: $429,272</code> |
429
- | <code>Organization: FRAZIER &amp; FRAZIER ATTYS<br>Location: Jacksonville, FL<br>Type: FOUNDATION</code> | <code>Grant: Grant to Cathedral Arts Project<br>Funder: FRAZIER &amp; FRAZIER ATTYS (FOUNDATION)<br>Amount: $2,500<br>Description: Purpose: To provide unrestricted general operating support to fulfill their mission<br>Recipient Location: Jacksonville, FL<br>Recipient Type: Public Charity<br>Amount: $2,500</code> |
430
- * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
431
- ```json
432
- {
433
- "scale": 20.0,
434
- "similarity_fct": "cos_sim",
435
- "gather_across_devices": false
436
- }
437
- ```
438
-
439
- ### Training Hyperparameters
440
- #### Non-Default Hyperparameters
441
-
442
- - `per_device_train_batch_size`: 32
443
- - `num_train_epochs`: 1
444
- - `max_steps`: 1000
445
- - `learning_rate`: 2e-05
446
- - `warmup_steps`: 0.1
447
- - `weight_decay`: 0.01
448
- - `gradient_accumulation_steps`: 4
449
- - `fp16`: True
450
- - `eval_strategy`: steps
451
- - `per_device_eval_batch_size`: 32
452
- - `dataloader_num_workers`: 4
453
- - `warmup_ratio`: 0.1
454
- - `batch_sampler`: no_duplicates
455
-
456
- #### All Hyperparameters
457
- <details><summary>Click to expand</summary>
458
-
459
- - `per_device_train_batch_size`: 32
460
- - `num_train_epochs`: 1
461
- - `max_steps`: 1000
462
- - `learning_rate`: 2e-05
463
- - `lr_scheduler_type`: linear
464
- - `lr_scheduler_kwargs`: None
465
- - `warmup_steps`: 0.1
466
- - `optim`: adamw_torch_fused
467
- - `optim_args`: None
468
- - `weight_decay`: 0.01
469
- - `adam_beta1`: 0.9
470
- - `adam_beta2`: 0.999
471
- - `adam_epsilon`: 1e-08
472
- - `optim_target_modules`: None
473
- - `gradient_accumulation_steps`: 4
474
- - `average_tokens_across_devices`: True
475
- - `max_grad_norm`: 1.0
476
- - `label_smoothing_factor`: 0.0
477
- - `bf16`: False
478
- - `fp16`: True
479
- - `bf16_full_eval`: False
480
- - `fp16_full_eval`: False
481
- - `tf32`: None
482
- - `gradient_checkpointing`: False
483
- - `gradient_checkpointing_kwargs`: None
484
- - `torch_compile`: False
485
- - `torch_compile_backend`: None
486
- - `torch_compile_mode`: None
487
- - `use_liger_kernel`: False
488
- - `liger_kernel_config`: None
489
- - `use_cache`: False
490
- - `neftune_noise_alpha`: None
491
- - `torch_empty_cache_steps`: None
492
- - `auto_find_batch_size`: False
493
- - `log_on_each_node`: True
494
- - `logging_nan_inf_filter`: True
495
- - `include_num_input_tokens_seen`: no
496
- - `log_level`: passive
497
- - `log_level_replica`: warning
498
- - `disable_tqdm`: False
499
- - `project`: huggingface
500
- - `trackio_space_id`: trackio
501
- - `eval_strategy`: steps
502
- - `per_device_eval_batch_size`: 32
503
- - `prediction_loss_only`: True
504
- - `eval_on_start`: False
505
- - `eval_do_concat_batches`: True
506
- - `eval_use_gather_object`: False
507
- - `eval_accumulation_steps`: None
508
- - `include_for_metrics`: []
509
- - `batch_eval_metrics`: False
510
- - `save_only_model`: False
511
- - `save_on_each_node`: False
512
- - `enable_jit_checkpoint`: False
513
- - `push_to_hub`: False
514
- - `hub_private_repo`: None
515
- - `hub_model_id`: None
516
- - `hub_strategy`: every_save
517
- - `hub_always_push`: False
518
- - `hub_revision`: None
519
- - `load_best_model_at_end`: False
520
- - `ignore_data_skip`: False
521
- - `restore_callback_states_from_checkpoint`: False
522
- - `full_determinism`: False
523
- - `seed`: 42
524
- - `data_seed`: None
525
- - `use_cpu`: False
526
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
527
- - `parallelism_config`: None
528
- - `dataloader_drop_last`: False
529
- - `dataloader_num_workers`: 4
530
- - `dataloader_pin_memory`: True
531
- - `dataloader_persistent_workers`: False
532
- - `dataloader_prefetch_factor`: None
533
- - `remove_unused_columns`: True
534
- - `label_names`: None
535
- - `train_sampling_strategy`: random
536
- - `length_column_name`: length
537
- - `ddp_find_unused_parameters`: None
538
- - `ddp_bucket_cap_mb`: None
539
- - `ddp_broadcast_buffers`: False
540
- - `ddp_backend`: None
541
- - `ddp_timeout`: 1800
542
- - `fsdp`: []
543
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
544
- - `deepspeed`: None
545
- - `debug`: []
546
- - `skip_memory_metrics`: True
547
- - `do_predict`: False
548
- - `resume_from_checkpoint`: None
549
- - `warmup_ratio`: 0.1
550
- - `local_rank`: -1
551
- - `prompts`: None
552
- - `batch_sampler`: no_duplicates
553
- - `multi_dataset_batch_sampler`: proportional
554
- - `router_mapping`: {}
555
- - `learning_rate_mapping`: {}
556
-
557
- </details>
558
-
559
- ### Training Logs
560
- | Epoch | Step | Training Loss | Validation Loss | val-similarity_spearman_cosine |
561
- |:------:|:----:|:-------------:|:---------------:|:------------------------------:|
562
- | 0.0099 | 25 | 1.7643 | - | - |
563
- | 0.0197 | 50 | 1.0715 | - | - |
564
- | 0.0296 | 75 | 0.4669 | - | - |
565
- | 0.0394 | 100 | 0.3204 | 0.2283 | nan |
566
- | 0.0493 | 125 | 0.3101 | - | - |
567
- | 0.0592 | 150 | 0.2830 | - | - |
568
- | 0.0690 | 175 | 0.3010 | - | - |
569
- | 0.0789 | 200 | 0.2790 | 0.2096 | nan |
570
- | 0.0888 | 225 | 0.2919 | - | - |
571
- | 0.0986 | 250 | 0.2608 | - | - |
572
- | 0.1085 | 275 | 0.2796 | - | - |
573
- | 0.1183 | 300 | 0.2559 | 0.1940 | nan |
574
- | 0.1282 | 325 | 0.2376 | - | - |
575
- | 0.1381 | 350 | 0.2491 | - | - |
576
- | 0.1479 | 375 | 0.2307 | - | - |
577
- | 0.1578 | 400 | 0.2233 | 0.1824 | nan |
578
- | 0.1677 | 425 | 0.2385 | - | - |
579
- | 0.1775 | 450 | 0.2356 | - | - |
580
- | 0.1874 | 475 | 0.2295 | - | - |
581
- | 0.1972 | 500 | 0.2104 | 0.1721 | nan |
582
- | 0.2071 | 525 | 0.2117 | - | - |
583
- | 0.2170 | 550 | 0.2100 | - | - |
584
- | 0.2268 | 575 | 0.2462 | - | - |
585
- | 0.2367 | 600 | 0.2402 | 0.1648 | nan |
586
- | 0.2465 | 625 | 0.1954 | - | - |
587
- | 0.2564 | 650 | 0.1890 | - | - |
588
- | 0.2663 | 675 | 0.2182 | - | - |
589
- | 0.2761 | 700 | 0.1878 | 0.1590 | nan |
590
- | 0.2860 | 725 | 0.2252 | - | - |
591
- | 0.2959 | 750 | 0.1886 | - | - |
592
- | 0.3057 | 775 | 0.1879 | - | - |
593
- | 0.3156 | 800 | 0.2009 | 0.1516 | nan |
594
- | 0.3254 | 825 | 0.1880 | - | - |
595
- | 0.3353 | 850 | 0.1872 | - | - |
596
- | 0.3452 | 875 | 0.1973 | - | - |
597
- | 0.3550 | 900 | 0.1944 | 0.1474 | nan |
598
- | 0.3649 | 925 | 0.1960 | - | - |
599
- | 0.3748 | 950 | 0.1993 | - | - |
600
- | 0.3846 | 975 | 0.1891 | - | - |
601
- | 0.3945 | 1000 | 0.1971 | 0.1458 | nan |
602
-
603
-
604
- ### Framework Versions
605
- - Python: 3.11.12
606
- - Sentence Transformers: 5.2.3
607
- - Transformers: 5.2.0
608
- - PyTorch: 2.10.0+cu128
609
- - Accelerate: 1.12.0
610
- - Datasets: 4.6.0
611
- - Tokenizers: 0.22.2
612
-
613
- ## Citation
614
-
615
- ### BibTeX
616
-
617
- #### Sentence Transformers
618
- ```bibtex
619
- @inproceedings{reimers-2019-sentence-bert,
620
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
621
- author = "Reimers, Nils and Gurevych, Iryna",
622
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
623
- month = "11",
624
- year = "2019",
625
- publisher = "Association for Computational Linguistics",
626
- url = "https://arxiv.org/abs/1908.10084",
627
- }
628
- ```
629
-
630
- #### MultipleNegativesRankingLoss
631
- ```bibtex
632
- @misc{henderson2017efficient,
633
- title={Efficient Natural Language Response Suggestion for Smart Reply},
634
- author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
635
- year={2017},
636
- eprint={1705.00652},
637
- archivePrefix={arXiv},
638
- primaryClass={cs.CL}
639
- }
640
- ```
641
-
642
- <!--
643
- ## Glossary
644
-
645
- *Clearly define terms in order to be accessible across audiences.*
646
- -->
647
-
648
- <!--
649
- ## Model Card Authors
650
-
651
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
652
- -->
653
-
654
- <!--
655
- ## Model Card Contact
656
-
657
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
658
- -->
659
-
660
- ---
661
-
662
- ## V2.0 Update: Foundation Grants Support (February 2026)
663
-
664
- ### What Changed
665
-
666
- V2 retrains the embedding model on **combined federal + foundation data**. The training set grew from federal-only pairs to **324,479 positive pairs** spanning NIH, NSF, and 37,684 private foundations.
667
 
668
- The model now understands the semantic relationship between:
669
- - **Federal grants**: Organization research profiles matched to NIH/NSF funding opportunities
670
- - **Foundation grants**: Foundation profiles matched to their actual grantmaking (recipient, purpose, amount)
671
 
672
- ### Training Details (V2)
 
 
 
 
673
 
674
- - **Hardware**: NVIDIA H100 80GB HBM3
675
- - **Training Steps**: 1,000 (LoRA fine-tuning)
676
- - **Base Model**: Qwen/Qwen3-Embedding-0.6B
677
- - **LoRA Config**: r=16, alpha=32, target=q/k/v/o projections
678
- - **Effective Batch Size**: 128 (32 x 4 gradient accumulation)
679
- - **Final Validation Loss**: 0.1458 (steadily decreasing from 0.2283)
680
 
681
- ### Downstream Impact
 
682
 
683
- When used as the similarity feature for the XGBoost classifier:
684
 
685
- | Metric | V1 (Federal Only) | V2 (Combined) |
686
- |--------|-------------------|---------------|
687
- | Overall AUC | 0.837 | **0.997** |
688
- | Federal AUC | 0.837 | **0.913** |
689
 
690
- The foundation-aware embeddings improved performance across the board, including on federal-only test data.
 
 
691
 
692
- ### Version Tags
693
 
694
- - `v1.0-federal-only`: Trained on NIH + NSF data only
695
- - `v2.0-with-foundations`: Trained on NIH + NSF + 37K foundation grants
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  tags:
4
+ - sentence-transformers
5
+ - sentence-similarity
6
+ - feature-extraction
7
+ - grant-matching
8
+ - nonprofit
9
+ - foundation-grants
 
10
  base_model: Qwen/Qwen3-Embedding-0.6B
11
+ datasets:
12
+ - ArkMaster123/grantpilot-training-data
13
+ language:
14
+ - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  pipeline_tag: sentence-similarity
16
  library_name: sentence-transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ---
18
 
19
+ # GrantPilot Embedding V2 (Federal + Foundation)
20
 
21
+ Fine-tuned [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) for grant-organization semantic matching. V2 extends coverage from federal-only (NIH/NSF) to include **37,684 private foundations**.
22
 
23
+ > **See also:** [V1 (federal-only)](https://huggingface.co/ArkMaster123/grantpilot-embedding) which outperforms OpenAI on federal grant retrieval.
24
 
25
+ ## Embedding Benchmark Results
 
 
 
 
 
 
 
 
26
 
27
+ Benchmarked on 998 test pairs (901 foundation, 78 NIH, 19 NSF) using retrieval and classification metrics.
28
 
29
+ ### Retrieval Quality
 
 
30
 
31
+ | Model | Dim | R@1 | R@5 | R@10 | MRR | NDCG@10 |
32
+ |-------|-----|-----|-----|------|-----|---------|
33
+ | OpenAI text-embedding-3-small | 1536 | **0.343** | **0.570** | **0.682** | **0.453** | **0.499** |
34
+ | Qwen3-Embedding-0.6B (base) | 1024 | 0.295 | 0.514 | 0.630 | 0.403 | 0.449 |
35
+ | **GrantPilot V2 (this model)** | 1024 | 0.295 | 0.516 | 0.622 | 0.403 | 0.446 |
36
 
37
+ **Verdict: OpenAI wins on retrieval.** The fine-tuned V2 embedding performs on par with the base Qwen3 model — fine-tuning did not meaningfully improve retrieval on this mixed dataset. V1 (federal-only) significantly outperformed OpenAI on federal retrieval, but adding 90% foundation data diluted that specialization.
 
 
 
 
 
 
38
 
39
+ ### AUC as Classifier Feature
 
 
 
 
 
 
 
 
40
 
41
+ | Model | Overall AUC | Foundation AUC | NIH AUC | NSF AUC |
42
+ |-------|-------------|----------------|---------|---------|
43
+ | OpenAI text-embedding-3-small | **0.886** | **0.972** | 0.473 | 0.524 |
44
+ | Qwen3-Embedding-0.6B (base) | 0.881 | 0.965 | 0.611 | 0.548 |
45
+ | **GrantPilot V2 (this model)** | 0.881 | 0.965 | **0.614** | 0.548 |
46
 
47
+ Interesting: OpenAI has the best overall AUC but **worst federal AUC** (0.47 on NIH — worse than random). Our fine-tuned model is best on federal grants.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
+ ### Inference Latency
 
50
 
51
+ | Model | Avg Latency | Cost |
52
+ |-------|-------------|------|
53
+ | OpenAI text-embedding-3-small | 43.9ms | API cost |
54
+ | Qwen3-Embedding-0.6B (base) | 2.9ms | Free (self-hosted) |
55
+ | **GrantPilot V2 (this model)** | **1.7ms** | Free (self-hosted) |
56
 
57
+ **25x faster than OpenAI** with zero API cost.
 
58
 
59
+ ### Comparison with V1
 
60
 
61
+ | Metric | V1 vs OpenAI | V2 vs OpenAI |
62
+ |--------|-------------|-------------|
63
+ | R@1 | **V1 wins (+46%)** | OpenAI wins |
64
+ | R@5 | **V1 wins (+22%)** | OpenAI wins |
65
+ | R@10 | **V1 wins (+28%)** | OpenAI wins |
66
 
67
+ V1 beat OpenAI decisively on federal grants. V2 lost that edge by training on a dataset that is 90% foundation data.
68
 
69
+ ## Why Use This Model?
 
70
 
71
+ The embedding alone is not the star — the **XGBoost classifier built on top** is where the real value comes from:
 
72
 
73
+ | Classifier Metric | V1 | V2 |
74
+ |-------------------|----|----|
75
+ | Overall AUC | 0.837 | **0.997** |
76
+ | Federal AUC | 0.837 | **0.913** |
77
+ | Accuracy | 72.1% | **98.3%** |
78
+ | F1 | 0.595 | **0.983** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
+ See: [grantpilot-classifier-v2](https://huggingface.co/ArkMaster123/grantpilot-classifier-v2)
 
81
 
82
  ## Training Details
83
 
84
+ - **Hardware**: NVIDIA H100 80GB
85
+ - **Training Steps**: 1,000 (LoRA fine-tuning)
86
+ - **Training Pairs**: 324,479 positive pairs
87
+ - **LoRA Config**: r=16, alpha=32, target=q/k/v/o projections
88
+ - **Batch Size**: 32 (x4 gradient accumulation = 128 effective)
89
+ - **Learning Rate**: 2e-5
90
+ - **Final Val Loss**: 0.1458
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
+ ### Training Data Composition
 
 
93
 
94
+ | Source | Pairs | % |
95
+ |--------|-------|---|
96
+ | Foundation (990-PF) | 292,401 | 90.1% |
97
+ | NIH | 25,717 | 7.9% |
98
+ | NSF | 6,361 | 2.0% |
99
 
100
+ ## Usage
 
 
 
 
 
101
 
102
+ ```python
103
+ from sentence_transformers import SentenceTransformer
104
 
105
+ model = SentenceTransformer("ArkMaster123/grantpilot-embedding-v2", trust_remote_code=True)
106
 
107
+ org_text = "Organization: Ford Foundation\nLocation: New York, NY\nType: FOUNDATION"
108
+ grant_text = "Grant: Support for civil society organizations\nAmount: $500,000"
 
 
109
 
110
+ embeddings = model.encode([org_text, grant_text])
111
+ similarity = embeddings[0] @ embeddings[1]
112
+ ```
113
 
114
+ ## Related Models
115
 
116
+ | Model | Description |
117
+ |-------|-------------|
118
+ | [grantpilot-embedding](https://huggingface.co/ArkMaster123/grantpilot-embedding) | V1 — federal-only, beats OpenAI on retrieval |
119
+ | [grantpilot-classifier](https://huggingface.co/ArkMaster123/grantpilot-classifier) | V1 — federal-only classifier (AUC 0.837) |
120
+ | [grantpilot-classifier-v2](https://huggingface.co/ArkMaster123/grantpilot-classifier-v2) | V2 — combined classifier (AUC 0.997) |
121
+ | [grantpilot-training-data](https://huggingface.co/datasets/ArkMaster123/grantpilot-training-data) | Training data (V1 at training/, V2 at training_v2/) |