jjgarciac egrace479 commited on
Commit
f39747b
·
verified ·
1 Parent(s): cb9b871

Add published reference citation (instead of preprint) (#2)

Browse files

- Add published reference citation (instead of preprint) (67fd1bb2bec62b4c85afbea6f6abe943c029f820)
- clarify citation and fix name ref to dataset (f1002043b4cab640bb03dc171b5165a6030a4e1c)
- Update bibtex reference of Sentence-BERT (89d93998cef0b6c287e69acd4c94fa4f0385f225)


Co-authored-by: Elizabeth Campolongo <egrace479@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +11 -9
README.md CHANGED
@@ -99,7 +99,7 @@ model_description: "Language model for embedding organismal trait descriptions.
99
 
100
  # Model Card for Trait2Vec
101
 
102
- Trait2Vec is a language model to embed organismal trait descriptions in a way that preserves the structure induced by a semantic similarity metric (e.g. SimGIC). The model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data) dataset. It is fine-tuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2).
103
  Through qualitative data exploration we observe the cosine similarity between embeddings of raw trait description is proportional to the semantic similarity of their corresponding ontological representations.
104
 
105
  ## Model Details
@@ -108,7 +108,7 @@ Through qualitative data exploration we observe the cosine similarity between em
108
 
109
  <!-- Provide a longer summary of what this model is. -->
110
 
111
- - **Developed by:** Jim Balhoff, Soumyashree Kar, Hilmar Lapp, Juan Garcia
112
  - **Model type:** Sentence Transformer
113
  - **Language(s) (NLP):** English
114
  - **License:** MIT
@@ -116,7 +116,7 @@ Through qualitative data exploration we observe the cosine similarity between em
116
 
117
  ### Model Sources
118
 
119
- - **Repository:** [Trait2Vec](https://github.com/Imageomics/char-sim/)
120
 
121
  ## Uses
122
 
@@ -129,7 +129,7 @@ It can be used to embed the textual trait descriptions associated with an organi
129
 
130
  ## Bias, Risks, and Limitations
131
 
132
- This model is finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), therefore it inherits its corresponding biases and risks. The training dataset([char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data)) introduces the biases of the single similarity metric and ontology. This means the embedding inherits that metric’s inductive biases, coverage gaps, and evolving definitions. Biological conclusions may differ under alternative metrics (e.g., Resnik, Jaccard) or other phenotype ontologies.
133
 
134
  ### Recommendations
135
 
@@ -169,7 +169,7 @@ print(similarities.shape)
169
 
170
  ### Training Data
171
 
172
- This model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data) dataset.
173
 
174
  * Size: 438,516 training samples
175
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
@@ -207,7 +207,7 @@ This model was trained on the [char-sim-data](https://huggingface.co/datasets/im
207
 
208
  ## Evaluation
209
 
210
- We tested Trait2Vec on a hold-out split of 20\% of the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/) dataset. No descriptor overlap was ensured.
211
 
212
  ### Testing Data, Factors & Metrics
213
 
@@ -304,7 +304,7 @@ If you use this model in your research, please cite both it and the source model
304
 
305
  ```bibtex
306
  @software{trait2vec2025,
307
- author = {Jim Balhoff and Soumyashree Kar and Hilmar Lapp and Juan Garcia},
308
  doi = {<doi once generated>},
309
  title = {Trait2Vec},
310
  version = {1.0.0},
@@ -324,8 +324,10 @@ If you use this model in your research, please cite both it and the source model
324
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
325
  month = "11",
326
  year = "2019",
 
327
  publisher = "Association for Computational Linguistics",
328
- url = "https://arxiv.org/abs/1908.10084",
 
329
  }
330
  ```
331
 
@@ -351,4 +353,4 @@ Juan Garcia
351
 
352
  ## Model Card Contact
353
 
354
- [jjgarcia@cs.unc.edu](mailto:jjgarcia@cs.unc.edu)
 
99
 
100
  # Model Card for Trait2Vec
101
 
102
+ Trait2Vec is a language model to embed organismal trait descriptions in a way that preserves the structure induced by a semantic similarity metric (e.g. SimGIC). The model was trained on the [Character Similarity Dataset](https://huggingface.co/datasets/imageomics/char-sim-data). It is fine-tuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2).
103
  Through qualitative data exploration we observe the cosine similarity between embeddings of raw trait description is proportional to the semantic similarity of their corresponding ontological representations.
104
 
105
  ## Model Details
 
108
 
109
  <!-- Provide a longer summary of what this model is. -->
110
 
111
+ - **Developed by:** Juan Garcia, Soumyashree Kar, Jim Balhoff, Hilmar Lapp
112
  - **Model type:** Sentence Transformer
113
  - **Language(s) (NLP):** English
114
  - **License:** MIT
 
116
 
117
  ### Model Sources
118
 
119
+ - **Repository:** [Imageomics/char-sim](https://github.com/Imageomics/char-sim/)
120
 
121
  ## Uses
122
 
 
129
 
130
  ## Bias, Risks, and Limitations
131
 
132
+ This model is finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), therefore it inherits its corresponding biases and risks. The training dataset([Character Similarity Dataset](https://huggingface.co/datasets/imageomics/char-sim-data)) introduces the biases of the single similarity metric and ontology. This means the embedding inherits that metric’s inductive biases, coverage gaps, and evolving definitions. Biological conclusions may differ under alternative metrics (e.g., Resnik, Jaccard) or other phenotype ontologies.
133
 
134
  ### Recommendations
135
 
 
169
 
170
  ### Training Data
171
 
172
+ This model was trained on the [Character Similarity Dataset](https://huggingface.co/datasets/imageomics/char-sim-data).
173
 
174
  * Size: 438,516 training samples
175
  * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
 
207
 
208
  ## Evaluation
209
 
210
+ We tested Trait2Vec on a hold-out split of 20\% of the [Character Similarity Dataset](https://huggingface.co/datasets/imageomics/char-sim-data/). No descriptor overlap was ensured.
211
 
212
  ### Testing Data, Factors & Metrics
213
 
 
304
 
305
  ```bibtex
306
  @software{trait2vec2025,
307
+ author = {Juan Garcia and Soumyashree Kar and Jim Balhoff and Hilmar Lapp},
308
  doi = {<doi once generated>},
309
  title = {Trait2Vec},
310
  version = {1.0.0},
 
324
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
325
  month = "11",
326
  year = "2019",
327
+ pages = "3982-3992",
328
  publisher = "Association for Computational Linguistics",
329
+ url = "https://aclanthology.org/D19-1410/",
330
+ doi = "10.18653/v1/D19-1410"
331
  }
332
  ```
333
 
 
353
 
354
  ## Model Card Contact
355
 
356
+ Please open a [Discussion on the Community Tab](https://huggingface.co/imageomics/trait2vec/discussions) with any questions on the model.