jjgarciac commited on
Commit
989191e
·
verified ·
1 Parent(s): 9da3822

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -449
README.md CHANGED
@@ -1,128 +1,62 @@
1
  ---
2
- tags:
3
- - sentence-transformers
4
- - sentence-similarity
5
- - feature-extraction
6
- - generated_from_trainer
7
- - dataset_size:438516
8
- - loss:CoSENTLoss
9
- base_model: sentence-transformers/all-mpnet-base-v2
10
- widget:
11
- - source_sentence: 'Ventral humeral ridge: or not'
12
- sentences:
13
- - >-
14
- If metasternum ossified, shape: long, narrow and tapering markedly
15
- anteriorly to posteriorly, length up to 3.5 times maximum width
16
- - >-
17
- Astragalus, dorsolateral margin:: overlaps the anterior and posterior
18
- portions of the calcaneum equally
19
- - 'Ulna size: does not apply'
20
- - source_sentence: >-
21
- Form of distal portion of anteroventral process of ectopterygoid: varyingly
22
- falcate
23
- sentences:
24
- - 'Middle and distal radials in dorsal and anal fins: absent'
25
- - >-
26
- Degree of development of primitively medial portion of fourth upper
27
- pharyngeal tooth-plate: fourth upper pharyngeal tooth-plate covers ventral,
28
- posterior, dorsal and sometimes anterior surfaces of fourth
29
- infrapharyngobranchial
30
- - 'Shape of pharyngeal apophysis (basioccipital): forked anteriorly'
31
- - source_sentence: >-
32
- Form of distal portion of anteroventral process of ectopterygoid: varyingly
33
- falcate
34
- sentences:
35
- - 'parhypural: present'
36
- - 'Epural: heavy'
37
- - 'First infraorbital: short'
38
- - source_sentence: >-
39
- Form of distal portion of anteroventral process of ectopterygoid: varyingly
40
- falcate
41
- sentences:
42
- - 'Dentary and angular: touch'
43
- - 'Urohyal and first basibranchial: firmly attached'
44
- - 'Supraneural 3-4 (nonadditive): absent'
45
- - source_sentence: >-
46
- Form of distal portion of anteroventral process of ectopterygoid: varyingly
47
- falcate
48
- sentences:
49
- - 'Ventral diverging lamellae of mesethmoid: lamellae reduced or absent'
50
- - 'Ventral ridge of the coracoid with a posterior process: absent'
51
- - 'carpals: fully or partially ossified'
52
- pipeline_tag: sentence-similarity
53
- library_name: sentence-transformers
54
- metrics:
55
- - pearson_cosine
56
- - spearman_cosine
57
- model-index:
58
- - name: SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
59
- results:
60
- - task:
61
- type: semantic-similarity
62
- name: Semantic Similarity
63
- dataset:
64
- name: pheno dev
65
- type: pheno-dev
66
- metrics:
67
- - type: pearson_cosine
68
- value: 0.6082332469417436
69
- name: Pearson Cosine
70
- - type: spearman_cosine
71
- value: 0.6250387873495056
72
- name: Spearman Cosine
73
- - task:
74
- type: semantic-similarity
75
- name: Semantic Similarity
76
- dataset:
77
- name: pheno test
78
- type: pheno-test
79
- metrics:
80
- - type: pearson_cosine
81
- value: 0.6822053314599665
82
- name: Pearson Cosine
83
- - type: spearman_cosine
84
- value: 0.705688010939619
85
- name: Spearman Cosine
86
  license: mit
87
  language:
88
  - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  ---
90
 
91
- # SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
92
 
93
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). It maps sentences & paragraphs to a 256-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 
94
 
95
  ## Model Details
96
 
97
  ### Model Description
98
- - **Model Type:** Sentence Transformer
99
- - **Base model:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) <!-- at revision 9a3225965996d404b775526de6dbfe85d3368642 -->
100
- - **Maximum Sequence Length:** 256 tokens
101
- - **Output Dimensionality:** 256 dimensions
102
- - **Similarity Function:** Cosine Similarity
103
- <!-- - **Training Dataset:** Unknown -->
104
- <!-- - **Language:** Unknown -->
105
- <!-- - **License:** Unknown -->
106
 
107
  ### Model Sources
108
 
109
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
110
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
111
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
112
 
113
- ### Full Model Architecture
114
 
115
- ```
116
- SentenceTransformer(
117
- (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: MPNetModel
118
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
119
- (2): Dense({'in_features': 768, 'out_features': 256, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
120
- )
121
- ```
122
 
123
- ## Usage
124
 
125
- ### Direct Usage (Sentence Transformers)
 
 
 
 
 
 
 
 
126
 
127
  First install the Sentence Transformers library:
128
 
@@ -152,62 +86,11 @@ print(similarities.shape)
152
  # [3, 3]
153
  ```
154
 
155
- <!--
156
- ### Direct Usage (Transformers)
157
-
158
- <details><summary>Click to see the direct usage in Transformers</summary>
159
-
160
- </details>
161
- -->
162
-
163
- <!--
164
- ### Downstream Usage (Sentence Transformers)
165
-
166
- You can finetune this model on your own dataset.
167
-
168
- <details><summary>Click to expand</summary>
169
-
170
- </details>
171
- -->
172
-
173
- <!--
174
- ### Out-of-Scope Use
175
-
176
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
177
- -->
178
-
179
- ## Evaluation
180
-
181
- ### Metrics
182
-
183
- #### Semantic Similarity
184
-
185
- * Datasets: `pheno-dev` and `pheno-test`
186
- * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
187
-
188
- | Metric | pheno-dev | pheno-test |
189
- |:--------------------|:----------|:-----------|
190
- | pearson_cosine | 0.6082 | 0.6822 |
191
- | **spearman_cosine** | **0.625** | **0.7057** |
192
-
193
- <!--
194
- ## Bias, Risks and Limitations
195
-
196
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
197
- -->
198
-
199
- <!--
200
- ### Recommendations
201
-
202
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
203
- -->
204
-
205
  ## Training Details
206
 
207
- ### Training Dataset
208
-
209
- #### Unnamed Dataset
210
 
 
211
 
212
  * Size: 438,516 training samples
213
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
@@ -230,11 +113,27 @@ You can finetune this model on your own dataset.
230
  }
231
  ```
232
 
233
- ### Evaluation Dataset
 
234
 
235
- #### Unnamed Dataset
 
 
 
 
 
 
 
236
 
237
 
 
 
 
 
 
 
 
 
238
  * Size: 111,628 evaluation samples
239
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
240
  * Approximate statistics based on the first 1000 samples:
@@ -256,284 +155,57 @@ You can finetune this model on your own dataset.
256
  }
257
  ```
258
 
259
- ### Training Hyperparameters
260
- #### Non-Default Hyperparameters
261
 
262
- - `eval_strategy`: steps
263
- - `per_device_train_batch_size`: 64
264
- - `per_device_eval_batch_size`: 64
265
- - `learning_rate`: 2e-05
266
- - `num_train_epochs`: 10
267
- - `warmup_ratio`: 1e-06
268
 
269
- #### All Hyperparameters
270
- <details><summary>Click to expand</summary>
271
 
272
- - `overwrite_output_dir`: False
273
- - `do_predict`: False
274
- - `eval_strategy`: steps
275
- - `prediction_loss_only`: True
276
- - `per_device_train_batch_size`: 64
277
- - `per_device_eval_batch_size`: 64
278
- - `per_gpu_train_batch_size`: None
279
- - `per_gpu_eval_batch_size`: None
280
- - `gradient_accumulation_steps`: 1
281
- - `eval_accumulation_steps`: None
282
- - `torch_empty_cache_steps`: None
283
- - `learning_rate`: 2e-05
284
- - `weight_decay`: 0.0
285
- - `adam_beta1`: 0.9
286
- - `adam_beta2`: 0.999
287
- - `adam_epsilon`: 1e-08
288
- - `max_grad_norm`: 1.0
289
- - `num_train_epochs`: 10
290
- - `max_steps`: -1
291
- - `lr_scheduler_type`: linear
292
- - `lr_scheduler_kwargs`: {}
293
- - `warmup_ratio`: 1e-06
294
- - `warmup_steps`: 0
295
- - `log_level`: passive
296
- - `log_level_replica`: warning
297
- - `log_on_each_node`: True
298
- - `logging_nan_inf_filter`: True
299
- - `save_safetensors`: True
300
- - `save_on_each_node`: False
301
- - `save_only_model`: False
302
- - `restore_callback_states_from_checkpoint`: False
303
- - `no_cuda`: False
304
- - `use_cpu`: False
305
- - `use_mps_device`: False
306
- - `seed`: 42
307
- - `data_seed`: None
308
- - `jit_mode_eval`: False
309
- - `use_ipex`: False
310
- - `bf16`: False
311
- - `fp16`: False
312
- - `fp16_opt_level`: O1
313
- - `half_precision_backend`: auto
314
- - `bf16_full_eval`: False
315
- - `fp16_full_eval`: False
316
- - `tf32`: None
317
- - `local_rank`: 0
318
- - `ddp_backend`: None
319
- - `tpu_num_cores`: None
320
- - `tpu_metrics_debug`: False
321
- - `debug`: []
322
- - `dataloader_drop_last`: False
323
- - `dataloader_num_workers`: 0
324
- - `dataloader_prefetch_factor`: None
325
- - `past_index`: -1
326
- - `disable_tqdm`: False
327
- - `remove_unused_columns`: True
328
- - `label_names`: None
329
- - `load_best_model_at_end`: False
330
- - `ignore_data_skip`: False
331
- - `fsdp`: []
332
- - `fsdp_min_num_params`: 0
333
- - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
334
- - `fsdp_transformer_layer_cls_to_wrap`: None
335
- - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
336
- - `deepspeed`: None
337
- - `label_smoothing_factor`: 0.0
338
- - `optim`: adamw_torch
339
- - `optim_args`: None
340
- - `adafactor`: False
341
- - `group_by_length`: False
342
- - `length_column_name`: length
343
- - `ddp_find_unused_parameters`: None
344
- - `ddp_bucket_cap_mb`: None
345
- - `ddp_broadcast_buffers`: False
346
- - `dataloader_pin_memory`: True
347
- - `dataloader_persistent_workers`: False
348
- - `skip_memory_metrics`: True
349
- - `use_legacy_prediction_loop`: False
350
- - `push_to_hub`: False
351
- - `resume_from_checkpoint`: None
352
- - `hub_model_id`: None
353
- - `hub_strategy`: every_save
354
- - `hub_private_repo`: None
355
- - `hub_always_push`: False
356
- - `gradient_checkpointing`: False
357
- - `gradient_checkpointing_kwargs`: None
358
- - `include_inputs_for_metrics`: False
359
- - `include_for_metrics`: []
360
- - `eval_do_concat_batches`: True
361
- - `fp16_backend`: auto
362
- - `push_to_hub_model_id`: None
363
- - `push_to_hub_organization`: None
364
- - `mp_parameters`:
365
- - `auto_find_batch_size`: False
366
- - `full_determinism`: False
367
- - `torchdynamo`: None
368
- - `ray_scope`: last
369
- - `ddp_timeout`: 1800
370
- - `torch_compile`: False
371
- - `torch_compile_backend`: None
372
- - `torch_compile_mode`: None
373
- - `dispatch_batches`: None
374
- - `split_batches`: None
375
- - `include_tokens_per_second`: False
376
- - `include_num_input_tokens_seen`: False
377
- - `neftune_noise_alpha`: None
378
- - `optim_target_modules`: None
379
- - `batch_eval_metrics`: False
380
- - `eval_on_start`: False
381
- - `use_liger_kernel`: False
382
- - `eval_use_gather_object`: False
383
- - `average_tokens_across_devices`: False
384
- - `prompts`: None
385
- - `batch_sampler`: batch_sampler
386
- - `multi_dataset_batch_sampler`: proportional
387
-
388
- </details>
389
-
390
- ### Training Logs
391
- <details><summary>Click to expand</summary>
392
-
393
- | Epoch | Step | Training Loss | Validation Loss | pheno-dev_spearman_cosine | pheno-test_spearman_cosine |
394
- |:------:|:-----:|:-------------:|:---------------:|:-------------------------:|:--------------------------:|
395
- | 0.0730 | 500 | 7.3492 | - | - | - |
396
- | 0.1459 | 1000 | 6.9718 | - | - | - |
397
- | 0.2189 | 1500 | 6.7986 | - | - | - |
398
- | 0.2919 | 2000 | 6.7157 | 8.8773 | 0.6305 | - |
399
- | 0.3649 | 2500 | 6.6327 | - | - | - |
400
- | 0.4378 | 3000 | 6.5661 | - | - | - |
401
- | 0.5108 | 3500 | 6.5309 | - | - | - |
402
- | 0.5838 | 4000 | 6.4737 | 10.0841 | 0.6116 | - |
403
- | 0.6567 | 4500 | 6.4516 | - | - | - |
404
- | 0.7297 | 5000 | 6.4235 | - | - | - |
405
- | 0.8027 | 5500 | 6.3908 | - | - | - |
406
- | 0.8757 | 6000 | 6.3602 | 10.8098 | 0.6071 | - |
407
- | 0.9486 | 6500 | 6.3315 | - | - | - |
408
- | 1.0216 | 7000 | 6.3236 | - | - | - |
409
- | 1.0946 | 7500 | 6.2753 | - | - | - |
410
- | 1.1675 | 8000 | 6.2845 | 11.9185 | 0.6263 | - |
411
- | 1.2405 | 8500 | 6.254 | - | - | - |
412
- | 1.3135 | 9000 | 6.2351 | - | - | - |
413
- | 1.3865 | 9500 | 6.2017 | - | - | - |
414
- | 1.4594 | 10000 | 6.2138 | 12.3766 | 0.6161 | - |
415
- | 1.5324 | 10500 | 6.2066 | - | - | - |
416
- | 1.6054 | 11000 | 6.1834 | - | - | - |
417
- | 1.6783 | 11500 | 6.1937 | - | - | - |
418
- | 1.7513 | 12000 | 6.1661 | 12.9426 | 0.6113 | - |
419
- | 1.8243 | 12500 | 6.1362 | - | - | - |
420
- | 1.8973 | 13000 | 6.1065 | - | - | - |
421
- | 1.9702 | 13500 | 6.1371 | - | - | - |
422
- | 2.0432 | 14000 | 6.0983 | 13.5966 | 0.6156 | - |
423
- | 2.1162 | 14500 | 6.0978 | - | - | - |
424
- | 2.1891 | 15000 | 6.0767 | - | - | - |
425
- | 2.2621 | 15500 | 6.066 | - | - | - |
426
- | 2.3351 | 16000 | 6.0739 | 13.9316 | 0.6260 | - |
427
- | 2.4081 | 16500 | 6.0635 | - | - | - |
428
- | 2.4810 | 17000 | 6.0616 | - | - | - |
429
- | 2.5540 | 17500 | 6.0219 | - | - | - |
430
- | 2.6270 | 18000 | 6.0129 | 14.3098 | 0.6158 | - |
431
- | 2.6999 | 18500 | 6.0414 | - | - | - |
432
- | 2.7729 | 19000 | 6.0317 | - | - | - |
433
- | 2.8459 | 19500 | 6.0158 | - | - | - |
434
- | 2.9189 | 20000 | 6.0078 | 14.6487 | 0.6188 | - |
435
- | 2.9918 | 20500 | 6.0295 | - | - | - |
436
- | 3.0648 | 21000 | 5.9664 | - | - | - |
437
- | 3.1378 | 21500 | 5.9682 | - | - | - |
438
- | 3.2107 | 22000 | 5.9755 | 15.2314 | 0.6202 | - |
439
- | 3.2837 | 22500 | 5.9608 | - | - | - |
440
- | 3.3567 | 23000 | 5.9469 | - | - | - |
441
- | 3.4297 | 23500 | 5.9673 | - | - | - |
442
- | 3.5026 | 24000 | 5.9496 | 15.4385 | 0.6237 | - |
443
- | 3.5756 | 24500 | 5.9148 | - | - | - |
444
- | 3.6486 | 25000 | 5.9568 | - | - | - |
445
- | 3.7215 | 25500 | 5.9135 | - | - | - |
446
- | 3.7945 | 26000 | 5.9363 | 15.3029 | 0.6217 | - |
447
- | 3.8675 | 26500 | 5.9096 | - | - | - |
448
- | 3.9405 | 27000 | 5.9171 | - | - | - |
449
- | 4.0134 | 27500 | 5.8955 | - | - | - |
450
- | 4.0864 | 28000 | 5.861 | 15.3221 | 0.6265 | - |
451
- | 4.1594 | 28500 | 5.8726 | - | - | - |
452
- | 4.2323 | 29000 | 5.8835 | - | - | - |
453
- | 4.3053 | 29500 | 5.8823 | - | - | - |
454
- | 4.3783 | 30000 | 5.8702 | 15.7276 | 0.6266 | - |
455
- | 4.4513 | 30500 | 5.8721 | - | - | - |
456
- | 4.5242 | 31000 | 5.8988 | - | - | - |
457
- | 4.5972 | 31500 | 5.8671 | - | - | - |
458
- | 4.6702 | 32000 | 5.8705 | 15.9223 | 0.6212 | - |
459
- | 4.7431 | 32500 | 5.8905 | - | - | - |
460
- | 4.8161 | 33000 | 5.8634 | - | - | - |
461
- | 4.8891 | 33500 | 5.8637 | - | - | - |
462
- | 4.9621 | 34000 | 5.8385 | 16.1225 | 0.6045 | - |
463
- | 5.0350 | 34500 | 5.8583 | - | - | - |
464
- | 5.1080 | 35000 | 5.821 | - | - | - |
465
- | 5.1810 | 35500 | 5.8219 | - | - | - |
466
- | 5.2539 | 36000 | 5.8367 | 15.6937 | 0.6240 | - |
467
- | 5.3269 | 36500 | 5.8245 | - | - | - |
468
- | 5.3999 | 37000 | 5.8161 | - | - | - |
469
- | 5.4729 | 37500 | 5.8138 | - | - | - |
470
- | 5.5458 | 38000 | 5.815 | 15.7507 | 0.6279 | - |
471
- | 5.6188 | 38500 | 5.8238 | - | - | - |
472
- | 5.6918 | 39000 | 5.8235 | - | - | - |
473
- | 5.7647 | 39500 | 5.8407 | - | - | - |
474
- | 5.8377 | 40000 | 5.8258 | 15.8875 | 0.6213 | - |
475
- | 5.9107 | 40500 | 5.7941 | - | - | - |
476
- | 5.9837 | 41000 | 5.8301 | - | - | - |
477
- | 6.0566 | 41500 | 5.7734 | - | - | - |
478
- | 6.1296 | 42000 | 5.7759 | 16.0155 | 0.6212 | - |
479
- | 6.2026 | 42500 | 5.7951 | - | - | - |
480
- | 6.2755 | 43000 | 5.8023 | - | - | - |
481
- | 6.3485 | 43500 | 5.7848 | - | - | - |
482
- | 6.4215 | 44000 | 5.7774 | 16.0796 | 0.6152 | - |
483
- | 6.4945 | 44500 | 5.7719 | - | - | - |
484
- | 6.5674 | 45000 | 5.7822 | - | - | - |
485
- | 6.6404 | 45500 | 5.7734 | - | - | - |
486
- | 6.7134 | 46000 | 5.7856 | 16.2461 | 0.6142 | - |
487
- | 6.7863 | 46500 | 5.7949 | - | - | - |
488
- | 6.8593 | 47000 | 5.8346 | - | - | - |
489
- | 6.9323 | 47500 | 5.7606 | - | - | - |
490
- | 7.0053 | 48000 | 5.7839 | 16.0556 | 0.6249 | - |
491
- | 7.0782 | 48500 | 5.7581 | - | - | - |
492
- | 7.1512 | 49000 | 5.7472 | - | - | - |
493
- | 7.2242 | 49500 | 5.7443 | - | - | - |
494
- | 7.2971 | 50000 | 5.7481 | 16.1126 | 0.6248 | - |
495
- | 7.3701 | 50500 | 5.7487 | - | - | - |
496
- | 7.4431 | 51000 | 5.7443 | - | - | - |
497
- | 7.5161 | 51500 | 5.76 | - | - | - |
498
- | 7.5890 | 52000 | 5.7353 | 16.0932 | 0.6312 | - |
499
- | 7.6620 | 52500 | 5.7632 | - | - | - |
500
- | 7.7350 | 53000 | 5.7788 | - | - | - |
501
- | 7.8079 | 53500 | 5.758 | - | - | - |
502
- | 7.8809 | 54000 | 5.7324 | 16.1470 | 0.6247 | - |
503
- | 7.9539 | 54500 | 5.7425 | - | - | - |
504
- | 8.0269 | 55000 | 5.7416 | - | - | - |
505
- | 8.0998 | 55500 | 5.7696 | - | - | - |
506
- | 8.1728 | 56000 | 5.7493 | 16.2547 | 0.6313 | - |
507
- | 8.2458 | 56500 | 5.7348 | - | - | - |
508
- | 8.3187 | 57000 | 5.7173 | - | - | - |
509
- | 8.3917 | 57500 | 5.7215 | - | - | - |
510
- | 8.4647 | 58000 | 5.7163 | 16.3313 | 0.6237 | - |
511
- | 8.5377 | 58500 | 5.722 | - | - | - |
512
- | 8.6106 | 59000 | 5.7292 | - | - | - |
513
- | 8.6836 | 59500 | 5.7295 | - | - | - |
514
- | 8.7566 | 60000 | 5.7267 | 16.3434 | 0.6261 | - |
515
- | 8.8295 | 60500 | 5.7207 | - | - | - |
516
- | 8.9025 | 61000 | 5.7252 | - | - | - |
517
- | 8.9755 | 61500 | 5.7061 | - | - | - |
518
- | 9.0485 | 62000 | 5.7113 | 16.2999 | 0.6279 | - |
519
- | 9.1214 | 62500 | 5.695 | - | - | - |
520
- | 9.1944 | 63000 | 5.7152 | - | - | - |
521
- | 9.2674 | 63500 | 5.7045 | - | - | - |
522
- | 9.3403 | 64000 | 5.6907 | 16.2782 | 0.6264 | - |
523
- | 9.4133 | 64500 | 5.7185 | - | - | - |
524
- | 9.4863 | 65000 | 5.6903 | - | - | - |
525
- | 9.5593 | 65500 | 5.705 | - | - | - |
526
- | 9.6322 | 66000 | 5.7165 | 16.3625 | 0.6249 | - |
527
- | 9.7052 | 66500 | 5.7027 | - | - | - |
528
- | 9.7782 | 67000 | 5.7048 | - | - | - |
529
- | 9.8511 | 67500 | 5.728 | - | - | - |
530
- | 9.9241 | 68000 | 5.7111 | 16.3087 | 0.6250 | - |
531
- | 9.9971 | 68500 | 5.7144 | - | - | - |
532
- | 10.0 | 68520 | - | - | - | 0.7057 |
533
-
534
- </details>
535
-
536
- ### Framework Versions
537
  - Python: 3.10.16
538
  - Sentence Transformers: 3.3.1
539
  - Transformers: 4.48.1
@@ -544,8 +216,7 @@ You can finetune this model on your own dataset.
544
 
545
  ## Citation
546
 
547
- ### BibTeX
548
-
549
  #### Sentence Transformers
550
  ```bibtex
551
  @inproceedings{reimers-2019-sentence-bert,
@@ -570,20 +241,15 @@ You can finetune this model on your own dataset.
570
  }
571
  ```
572
 
573
- <!--
574
- ## Glossary
575
 
576
- *Clearly define terms in order to be accessible across audiences.*
577
- -->
 
578
 
579
- <!--
580
  ## Model Card Authors
581
 
582
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
583
- -->
584
 
585
- <!--
586
  ## Model Card Contact
587
 
588
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
589
- -->
 
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
3
  language:
4
  - en
5
+ library_name: sentence-transformers
6
+ tags:
7
+ - ontology
8
+ - nlp
9
+ - biology
10
+ - animals
11
+ - fish
12
+ - embedding
13
+ - trait
14
+ datasets:
15
+ - imageomics/char-sim-data
16
+ metrics: # key list: https://hf.co/metrics
17
+ model_name: Trait2Vec
18
+ model_description: "Language model for embedding organismal trait descriptions. Built using Sentence-Transformer architecture and trained with trait descriptions from char-sim-data."
19
  ---
20
 
21
+ # Model Card for Trait2Vec
22
 
23
+ Trait2Vec is a model for the tree of life, built using CLIP architecture as a language model to embed organismal trait descriptions in a way that preserves the structure induced by a semantic similarity (e.g. SimGIC). The model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/edit/main/README.md).
24
+ Through qualitative data exploration we observe the cosine similarity between embeddings of raw trait description is proportional to the semantic similarity of their corresponding ontological representations.
25
 
26
  ## Model Details
27
 
28
  ### Model Description
29
+
30
+ <!-- Provide a longer summary of what this model is. -->
31
+
32
+ - **Developed by:** Jim Balhoff, Soumyashree Kar, Hilmar Lapp, Juan Garcia
33
+ - **Model type:** Sentence Transformer
34
+ - **Language(s) (NLP):** English
35
+ - **License:** MIT
36
+ - **Fine-tuned from model:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
37
 
38
  ### Model Sources
39
 
40
+ - **Repository:** [Trait2Vec](https://github.com/Imageomics/char-sim/tree/main)
 
 
41
 
42
+ ## Uses
43
 
44
+ Trait2Vec has been qualitatively evaluated in the ability to embed raw trait descriptions in a way that preserves the structure of an ontology. Accordingly, we expect it to produce an alternative computational representation of the traits of an organism.
45
+
46
+ ### Direct Use
47
+
48
+ It can be used to embed the textual trait descriptions associated with an organism.
 
 
49
 
 
50
 
51
+ ## Bias, Risks, and Limitations
52
+
53
+ This model is finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), therefore it inherets its corresponding biases and risks. The training dataset[char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/edit/main/README.md) introduces the biases of the single similarity metric and ontology. This means the embedding inherits that metric’s inductive biases, coverage gaps, and evolving definitions. Biological conclusions may differ under alternative metrics (e.g., Resnik, Jaccard) or other phenotype ontologies.
54
+
55
+ ### Recommendations
56
+
57
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
58
+
59
+ ## How to Get Started with the Model
60
 
61
  First install the Sentence Transformers library:
62
 
 
86
  # [3, 3]
87
  ```
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  ## Training Details
90
 
91
+ ### Training Data
 
 
92
 
93
+ This model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/edit/main/README.md) dataset.
94
 
95
  * Size: 438,516 training samples
96
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
 
113
  }
114
  ```
115
 
116
+ #### Training Hyperparameters
117
+ #### Non-Default Hyperparameters
118
 
119
+ - `eval_strategy`: steps
120
+ - `per_device_train_batch_size`: 64
121
+ - `per_device_eval_batch_size`: 64
122
+ - `learning_rate`: 2e-05
123
+ - `num_train_epochs`: 10
124
+ - `warmup_ratio`: 1e-06
125
+
126
+ - **Training regime:** fp32
127
 
128
 
129
+ ## Evaluation
130
+
131
+ We tested Trait2Vec on a hold-out split of 20\% of the ['char-sim-data'](https://huggingface.co/datasets/imageomics/char-sim-data/tree/main) dataset. No descriptor overlap was ensured.
132
+
133
+ ### Testing Data, Factors & Metrics
134
+
135
+ #### Testing Data
136
+
137
  * Size: 111,628 evaluation samples
138
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
139
  * Approximate statistics based on the first 1000 samples:
 
155
  }
156
  ```
157
 
158
+ #### Metrics
 
159
 
160
+ * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
 
 
 
 
 
161
 
 
 
162
 
163
+ ### Results
164
+
165
+ | Metric | Validation set | Test set |
166
+ |:--------------------|:----------|:-----------|
167
+ | pearson_cosine | 0.6082 | 0.6822 |
168
+ | **spearman_cosine** | **0.625** | **0.7057** |
169
+
170
+ #### Summary
171
+
172
+ Trait2Vec embeds organismal trait descriptors in a way that preserves some of the ranking structure induced by the similarity metric of the ontology.
173
+
174
+ ## Environmental Impact
175
+
176
+ Experiments were conducted using a private infrastructure, which has a carbon efficiency of 0.432 kgCO$_2$eq/kWh. A cumulative of 20 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W).
177
+
178
+ Total emissions are estimated to be 2.16 kgCO$_2$eq of which 0 percents were directly offset.
179
+
180
+ Estimations were conducted using the [MachineLearning Impact calculator](https://mlco2.github.io/impact#compute) presented in:
181
+ ```bibtex
182
+ @article{lacoste2019quantifying,
183
+ title={Quantifying the Carbon Emissions of Machine Learning},
184
+ author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas},
185
+ journal={arXiv preprint arXiv:1910.09700},
186
+ year={2019}
187
+ }
188
+ ```
189
+
190
+ ### Model Architecture and Objective
191
+
192
+ ```
193
+ SentenceTransformer(
194
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: MPNetModel
195
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
196
+ (2): Dense({'in_features': 768, 'out_features': 256, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
197
+ )
198
+ ```
199
+ * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
200
+ ```json
201
+ {
202
+ "scale": 20.0,
203
+ "similarity_fct": "pairwise_cos_sim"
204
+ }
205
+ ```
206
+
207
+ #### Software
208
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
209
  - Python: 3.10.16
210
  - Sentence Transformers: 3.3.1
211
  - Transformers: 4.48.1
 
216
 
217
  ## Citation
218
 
219
+ **BibTeX:**
 
220
  #### Sentence Transformers
221
  ```bibtex
222
  @inproceedings{reimers-2019-sentence-bert,
 
241
  }
242
  ```
243
 
 
 
244
 
245
+ ## Acknowledgements
246
+
247
+ This work was supported by the [Imageomics Institute](https://imageomics.org), which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under [Award #2118240](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2118240) (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
248
 
 
249
  ## Model Card Authors
250
 
251
+ Juan Garcia
 
252
 
 
253
  ## Model Card Contact
254
 
255
+ [jjgarcia@cs.unc.edu](mailto:jjgarcia@cs.unc.edu)