jjgarciac egrace479 commited on
Commit
6991a45
·
verified ·
1 Parent(s): 8f9a87f

Fix some URLs, re-add widgets, add citation (#1)

Browse files

- Fix some URLs, re-add widgets, add citation (13065327e5088ec7f5d604e0d2a3d56e1db1ca65)


Co-authored-by: Elizabeth Campolongo <egrace479@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +108 -9
README.md CHANGED
@@ -2,7 +2,83 @@
2
  license: mit
3
  language:
4
  - en
5
- library_name: sentence-transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  tags:
7
  - ontology
8
  - nlp
@@ -11,16 +87,19 @@ tags:
11
  - fish
12
  - embedding
13
  - trait
 
 
 
 
14
  datasets:
15
  - imageomics/char-sim-data
16
- metrics: # key list: https://hf.co/metrics
17
  model_name: Trait2Vec
18
- model_description: "Language model for embedding organismal trait descriptions. Built using Sentence-Transformer architecture and trained with trait descriptions from char-sim-data."
19
  ---
20
 
21
  # Model Card for Trait2Vec
22
 
23
- Trait2Vec is a language model to embed organismal trait descriptions in a way that preserves the structure induced by a semantic similarity metric (e.g. SimGIC). The model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/edit/main/README.md).
24
  Through qualitative data exploration we observe the cosine similarity between embeddings of raw trait description is proportional to the semantic similarity of their corresponding ontological representations.
25
 
26
  ## Model Details
@@ -37,7 +116,7 @@ Through qualitative data exploration we observe the cosine similarity between em
37
 
38
  ### Model Sources
39
 
40
- - **Repository:** [Trait2Vec](https://github.com/Imageomics/char-sim/tree/main)
41
 
42
  ## Uses
43
 
@@ -50,7 +129,7 @@ It can be used to embed the textual trait descriptions associated with an organi
50
 
51
  ## Bias, Risks, and Limitations
52
 
53
- This model is finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), therefore it inherets its corresponding biases and risks. The training dataset[char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/edit/main/README.md) introduces the biases of the single similarity metric and ontology. This means the embedding inherits that metric’s inductive biases, coverage gaps, and evolving definitions. Biological conclusions may differ under alternative metrics (e.g., Resnik, Jaccard) or other phenotype ontologies.
54
 
55
  ### Recommendations
56
 
@@ -90,7 +169,7 @@ print(similarities.shape)
90
 
91
  ### Training Data
92
 
93
- This model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/edit/main/README.md) dataset.
94
 
95
  * Size: 438,516 training samples
96
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
@@ -128,7 +207,7 @@ This model was trained on the [char-sim-data](https://huggingface.co/datasets/im
128
 
129
  ## Evaluation
130
 
131
- We tested Trait2Vec on a hold-out split of 20\% of the ['char-sim-data'](https://huggingface.co/datasets/imageomics/char-sim-data/tree/main) dataset. No descriptor overlap was ensured.
132
 
133
  ### Testing Data, Factors & Metrics
134
 
@@ -157,6 +236,9 @@ We tested Trait2Vec on a hold-out split of 20\% of the ['char-sim-data'](https:/
157
 
158
  #### Metrics
159
 
 
 
 
160
  * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
161
 
162
 
@@ -216,7 +298,24 @@ SentenceTransformer(
216
 
217
  ## Citation
218
 
219
- **BibTeX:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
220
  #### Sentence Transformers
221
  ```bibtex
222
  @inproceedings{reimers-2019-sentence-bert,
 
2
  license: mit
3
  language:
4
  - en
5
+ widget:
6
+ - source_sentence: 'Ventral humeral ridge: or not'
7
+ sentences:
8
+ - >-
9
+ If metasternum ossified, shape: long, narrow and tapering markedly
10
+ anteriorly to posteriorly, length up to 3.5 times maximum width
11
+ - >-
12
+ Astragalus, dorsolateral margin:: overlaps the anterior and posterior
13
+ portions of the calcaneum equally
14
+ - 'Ulna size: does not apply'
15
+ - source_sentence: >-
16
+ Form of distal portion of anteroventral process of ectopterygoid: varyingly
17
+ falcate
18
+ sentences:
19
+ - 'Middle and distal radials in dorsal and anal fins: absent'
20
+ - >-
21
+ Degree of development of primitively medial portion of fourth upper
22
+ pharyngeal tooth-plate: fourth upper pharyngeal tooth-plate covers ventral,
23
+ posterior, dorsal and sometimes anterior surfaces of fourth
24
+ infrapharyngobranchial
25
+ - 'Shape of pharyngeal apophysis (basioccipital): forked anteriorly'
26
+ - source_sentence: >-
27
+ Form of distal portion of anteroventral process of ectopterygoid: varyingly
28
+ falcate
29
+ sentences:
30
+ - 'parhypural: present'
31
+ - 'Epural: heavy'
32
+ - 'First infraorbital: short'
33
+ - source_sentence: >-
34
+ Form of distal portion of anteroventral process of ectopterygoid: varyingly
35
+ falcate
36
+ sentences:
37
+ - 'Dentary and angular: touch'
38
+ - 'Urohyal and first basibranchial: firmly attached'
39
+ - 'Supraneural 3-4 (nonadditive): absent'
40
+ - source_sentence: >-
41
+ Form of distal portion of anteroventral process of ectopterygoid: varyingly
42
+ falcate
43
+ sentences:
44
+ - 'Ventral diverging lamellae of mesethmoid: lamellae reduced or absent'
45
+ - 'Ventral ridge of the coracoid with a posterior process: absent'
46
+ - 'carpals: fully or partially ossified'
47
+ pipeline_tag: sentence-similarity
48
+ library_name: sentence-transformers
49
+ base_model: sentence-transformers/all-mpnet-base-v2
50
+ metrics:
51
+ - pearson_cosine
52
+ - spearman_cosine
53
+ model-index:
54
+ - name: SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
55
+ results:
56
+ - task:
57
+ type: semantic-similarity
58
+ name: Semantic Similarity
59
+ dataset:
60
+ name: pheno dev
61
+ type: pheno-dev
62
+ metrics:
63
+ - type: pearson_cosine
64
+ value: 0.6082332469417436
65
+ name: Pearson Cosine
66
+ - type: spearman_cosine
67
+ value: 0.6250387873495056
68
+ name: Spearman Cosine
69
+ - task:
70
+ type: semantic-similarity
71
+ name: Semantic Similarity
72
+ dataset:
73
+ name: pheno test
74
+ type: pheno-test
75
+ metrics:
76
+ - type: pearson_cosine
77
+ value: 0.6822053314599665
78
+ name: Pearson Cosine
79
+ - type: spearman_cosine
80
+ value: 0.705688010939619
81
+ name: Spearman Cosine
82
  tags:
83
  - ontology
84
  - nlp
 
87
  - fish
88
  - embedding
89
  - trait
90
+ - sentence-transformers
91
+ - sentence-similarity
92
+ - feature-extraction
93
+ - loss:CoSENTLoss
94
  datasets:
95
  - imageomics/char-sim-data
 
96
  model_name: Trait2Vec
97
+ model_description: "Language model for embedding organismal trait descriptions. Built using Sentence-Transformer architecture and trained with trait descriptions from Imageomics/char-sim-data."
98
  ---
99
 
100
  # Model Card for Trait2Vec
101
 
102
+ Trait2Vec is a language model to embed organismal trait descriptions in a way that preserves the structure induced by a semantic similarity metric (e.g. SimGIC). The model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data) dataset. It is fine-tuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). This was removed, should it have been?>>>It maps sentences & paragraphs to a 256-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
103
  Through qualitative data exploration we observe the cosine similarity between embeddings of raw trait description is proportional to the semantic similarity of their corresponding ontological representations.
104
 
105
  ## Model Details
 
116
 
117
  ### Model Sources
118
 
119
+ - **Repository:** [Trait2Vec](https://github.com/Imageomics/char-sim/)
120
 
121
  ## Uses
122
 
 
129
 
130
  ## Bias, Risks, and Limitations
131
 
132
+ This model is finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), therefore it inherits its corresponding biases and risks. The training dataset([char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data)) introduces the biases of the single similarity metric and ontology. This means the embedding inherits that metric’s inductive biases, coverage gaps, and evolving definitions. Biological conclusions may differ under alternative metrics (e.g., Resnik, Jaccard) or other phenotype ontologies.
133
 
134
  ### Recommendations
135
 
 
169
 
170
  ### Training Data
171
 
172
+ This model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data) dataset.
173
 
174
  * Size: 438,516 training samples
175
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
 
207
 
208
  ## Evaluation
209
 
210
+ We tested Trait2Vec on a hold-out split of 20\% of the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/) dataset. No descriptor overlap was ensured.
211
 
212
  ### Testing Data, Factors & Metrics
213
 
 
236
 
237
  #### Metrics
238
 
239
+ **Semantic Similarity:**
240
+
241
+ * Datasets: `pheno-dev` and `pheno-test`
242
  * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
243
 
244
 
 
298
 
299
  ## Citation
300
 
301
+ If you use this model in your research, please cite both it and the source model & method from which it was fine-tuned:
302
+
303
+ ### Model
304
+
305
+ ```bibtex
306
+ @software{trait2vec2025,
307
+ author = {Jim Balhoff and Soumyashree Kar and Hilmar Lapp and Juan Garcia},
308
+ doi = {<doi once generated>},
309
+ title = {Trait2Vec},
310
+ version = {1.0.0},
311
+ year = {2025},
312
+ url = {https://huggingface.co/imageomics/trait2vec}
313
+ }
314
+ ```
315
+
316
+
317
+ ### Source Model & Method
318
+
319
  #### Sentence Transformers
320
  ```bibtex
321
  @inproceedings{reimers-2019-sentence-bert,