Add published reference citation (instead of preprint) (#2)
Browse files- Add published reference citation (instead of preprint) (67fd1bb2bec62b4c85afbea6f6abe943c029f820)
- clarify citation and fix name ref to dataset (f1002043b4cab640bb03dc171b5165a6030a4e1c)
- Update bibtex reference of Sentence-BERT (89d93998cef0b6c287e69acd4c94fa4f0385f225)
Co-authored-by: Elizabeth Campolongo <egrace479@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -99,7 +99,7 @@ model_description: "Language model for embedding organismal trait descriptions.
|
|
| 99 |
|
| 100 |
# Model Card for Trait2Vec
|
| 101 |
|
| 102 |
-
Trait2Vec is a language model to embed organismal trait descriptions in a way that preserves the structure induced by a semantic similarity metric (e.g. SimGIC). The model was trained on the [
|
| 103 |
Through qualitative data exploration we observe the cosine similarity between embeddings of raw trait description is proportional to the semantic similarity of their corresponding ontological representations.
|
| 104 |
|
| 105 |
## Model Details
|
|
@@ -108,7 +108,7 @@ Through qualitative data exploration we observe the cosine similarity between em
|
|
| 108 |
|
| 109 |
<!-- Provide a longer summary of what this model is. -->
|
| 110 |
|
| 111 |
-
- **Developed by:**
|
| 112 |
- **Model type:** Sentence Transformer
|
| 113 |
- **Language(s) (NLP):** English
|
| 114 |
- **License:** MIT
|
|
@@ -116,7 +116,7 @@ Through qualitative data exploration we observe the cosine similarity between em
|
|
| 116 |
|
| 117 |
### Model Sources
|
| 118 |
|
| 119 |
-
- **Repository:** [
|
| 120 |
|
| 121 |
## Uses
|
| 122 |
|
|
@@ -129,7 +129,7 @@ It can be used to embed the textual trait descriptions associated with an organi
|
|
| 129 |
|
| 130 |
## Bias, Risks, and Limitations
|
| 131 |
|
| 132 |
-
This model is finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), therefore it inherits its corresponding biases and risks. The training dataset([
|
| 133 |
|
| 134 |
### Recommendations
|
| 135 |
|
|
@@ -169,7 +169,7 @@ print(similarities.shape)
|
|
| 169 |
|
| 170 |
### Training Data
|
| 171 |
|
| 172 |
-
This model was trained on the [
|
| 173 |
|
| 174 |
* Size: 438,516 training samples
|
| 175 |
* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
|
|
@@ -207,7 +207,7 @@ This model was trained on the [char-sim-data](https://huggingface.co/datasets/im
|
|
| 207 |
|
| 208 |
## Evaluation
|
| 209 |
|
| 210 |
-
We tested Trait2Vec on a hold-out split of 20\% of the [
|
| 211 |
|
| 212 |
### Testing Data, Factors & Metrics
|
| 213 |
|
|
@@ -304,7 +304,7 @@ If you use this model in your research, please cite both it and the source model
|
|
| 304 |
|
| 305 |
```bibtex
|
| 306 |
@software{trait2vec2025,
|
| 307 |
-
author = {
|
| 308 |
doi = {<doi once generated>},
|
| 309 |
title = {Trait2Vec},
|
| 310 |
version = {1.0.0},
|
|
@@ -324,8 +324,10 @@ If you use this model in your research, please cite both it and the source model
|
|
| 324 |
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
|
| 325 |
month = "11",
|
| 326 |
year = "2019",
|
|
|
|
| 327 |
publisher = "Association for Computational Linguistics",
|
| 328 |
-
url = "https://
|
|
|
|
| 329 |
}
|
| 330 |
```
|
| 331 |
|
|
@@ -351,4 +353,4 @@ Juan Garcia
|
|
| 351 |
|
| 352 |
## Model Card Contact
|
| 353 |
|
| 354 |
-
[
|
|
|
|
| 99 |
|
| 100 |
# Model Card for Trait2Vec
|
| 101 |
|
| 102 |
+
Trait2Vec is a language model to embed organismal trait descriptions in a way that preserves the structure induced by a semantic similarity metric (e.g. SimGIC). The model was trained on the [Character Similarity Dataset](https://huggingface.co/datasets/imageomics/char-sim-data). It is fine-tuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2).
|
| 103 |
Through qualitative data exploration we observe the cosine similarity between embeddings of raw trait description is proportional to the semantic similarity of their corresponding ontological representations.
|
| 104 |
|
| 105 |
## Model Details
|
|
|
|
| 108 |
|
| 109 |
<!-- Provide a longer summary of what this model is. -->
|
| 110 |
|
| 111 |
+
- **Developed by:** Juan Garcia, Soumyashree Kar, Jim Balhoff, Hilmar Lapp
|
| 112 |
- **Model type:** Sentence Transformer
|
| 113 |
- **Language(s) (NLP):** English
|
| 114 |
- **License:** MIT
|
|
|
|
| 116 |
|
| 117 |
### Model Sources
|
| 118 |
|
| 119 |
+
- **Repository:** [Imageomics/char-sim](https://github.com/Imageomics/char-sim/)
|
| 120 |
|
| 121 |
## Uses
|
| 122 |
|
|
|
|
| 129 |
|
| 130 |
## Bias, Risks, and Limitations
|
| 131 |
|
| 132 |
+
This model is finetuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), therefore it inherits its corresponding biases and risks. The training dataset([Character Similarity Dataset](https://huggingface.co/datasets/imageomics/char-sim-data)) introduces the biases of the single similarity metric and ontology. This means the embedding inherits that metric’s inductive biases, coverage gaps, and evolving definitions. Biological conclusions may differ under alternative metrics (e.g., Resnik, Jaccard) or other phenotype ontologies.
|
| 133 |
|
| 134 |
### Recommendations
|
| 135 |
|
|
|
|
| 169 |
|
| 170 |
### Training Data
|
| 171 |
|
| 172 |
+
This model was trained on the [Character Similarity Dataset](https://huggingface.co/datasets/imageomics/char-sim-data).
|
| 173 |
|
| 174 |
* Size: 438,516 training samples
|
| 175 |
* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
|
|
|
|
| 207 |
|
| 208 |
## Evaluation
|
| 209 |
|
| 210 |
+
We tested Trait2Vec on a hold-out split of 20\% of the [Character Similarity Dataset](https://huggingface.co/datasets/imageomics/char-sim-data/). No descriptor overlap was ensured.
|
| 211 |
|
| 212 |
### Testing Data, Factors & Metrics
|
| 213 |
|
|
|
|
| 304 |
|
| 305 |
```bibtex
|
| 306 |
@software{trait2vec2025,
|
| 307 |
+
author = {Juan Garcia and Soumyashree Kar and Jim Balhoff and Hilmar Lapp},
|
| 308 |
doi = {<doi once generated>},
|
| 309 |
title = {Trait2Vec},
|
| 310 |
version = {1.0.0},
|
|
|
|
| 324 |
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
|
| 325 |
month = "11",
|
| 326 |
year = "2019",
|
| 327 |
+
pages = "3982-3992",
|
| 328 |
publisher = "Association for Computational Linguistics",
|
| 329 |
+
url = "https://aclanthology.org/D19-1410/",
|
| 330 |
+
doi = "10.18653/v1/D19-1410"
|
| 331 |
}
|
| 332 |
```
|
| 333 |
|
|
|
|
| 353 |
|
| 354 |
## Model Card Contact
|
| 355 |
|
| 356 |
+
Please open a [Discussion on the Community Tab](https://huggingface.co/imageomics/trait2vec/discussions) with any questions on the model.
|