imageomics
/

trait2vec

- Fix some URLs, re-add widgets, add citation (13065327e5088ec7f5d604e0d2a3d56e1db1ca65)

Co-authored-by: Elizabeth Campolongo <egrace479@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +108 -9

README.md CHANGED Viewed

@@ -2,7 +2,83 @@
 license: mit
 language:
 - en
-library_name: sentence-transformers
 tags:
 - ontology
 - nlp
@@ -11,16 +87,19 @@ tags:
 - fish
 - embedding
 - trait
 datasets:
 - imageomics/char-sim-data
-metrics: # key list: https://hf.co/metrics
 model_name: Trait2Vec
-model_description: "Language model for embedding organismal trait descriptions. Built using Sentence-Transformer architecture and trained with trait descriptions from char-sim-data."
 ---
 # Model Card for Trait2Vec
-Trait2Vec is a language model to embed organismal trait descriptions in a way that preserves the structure induced by a semantic similarity metric (e.g. SimGIC). The model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/edit/main/README.md).
 Through qualitative data exploration we observe the cosine similarity between embeddings of raw trait description is proportional to the semantic similarity of their corresponding ontological representations.
 ## Model Details
@@ -37,7 +116,7 @@ Through qualitative data exploration we observe the cosine similarity between em
 ### Model Sources
-- **Repository:** [Trait2Vec](https://github.com/Imageomics/char-sim/tree/main)
 ## Uses
@@ -50,7 +129,7 @@ It can be used to embed the textual trait descriptions associated with an organi
 ## Bias, Risks, and Limitations
-This model is finetuned from  [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), therefore it inherets its corresponding biases and risks. The training dataset[char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/edit/main/README.md) introduces the biases of the single similarity metric and ontology. This means the embedding inherits that metric’s inductive biases, coverage gaps, and evolving definitions. Biological conclusions may differ under alternative metrics (e.g., Resnik, Jaccard) or other phenotype ontologies.
 ### Recommendations
@@ -90,7 +169,7 @@ print(similarities.shape)
 ### Training Data
-This model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/edit/main/README.md) dataset.
 * Size: 438,516 training samples
 * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
@@ -128,7 +207,7 @@ This model was trained on the [char-sim-data](https://huggingface.co/datasets/im
 ## Evaluation
-We tested Trait2Vec on a hold-out split of 20\% of the ['char-sim-data'](https://huggingface.co/datasets/imageomics/char-sim-data/tree/main) dataset. No descriptor overlap was ensured.
 ### Testing Data, Factors & Metrics
@@ -157,6 +236,9 @@ We tested Trait2Vec on a hold-out split of 20\% of the ['char-sim-data'](https:/
 #### Metrics
 * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
@@ -216,7 +298,24 @@ SentenceTransformer(
 ## Citation
-**BibTeX:**
 #### Sentence Transformers
 ```bibtex
 @inproceedings{reimers-2019-sentence-bert,

 license: mit
 language:
 - en
+widget:
+- source_sentence: 'Ventral humeral ridge: or not'
+  sentences:
+  - >-
+    If metasternum ossified, shape: long, narrow and tapering markedly
+    anteriorly to posteriorly, length up to 3.5 times maximum width
+  - >-
+    Astragalus, dorsolateral margin:: overlaps the anterior and posterior
+    portions of the calcaneum equally
+  - 'Ulna size: does not apply'
+- source_sentence: >-
+    Form of distal portion of anteroventral process of ectopterygoid: varyingly
+    falcate
+  sentences:
+  - 'Middle and distal radials in dorsal and anal fins: absent'
+  - >-
+    Degree of development of primitively medial portion of fourth upper
+    pharyngeal tooth-plate: fourth upper pharyngeal tooth-plate covers ventral,
+    posterior, dorsal and sometimes anterior surfaces of fourth
+    infrapharyngobranchial
+  - 'Shape of pharyngeal apophysis (basioccipital): forked anteriorly'
+- source_sentence: >-
+    Form of distal portion of anteroventral process of ectopterygoid: varyingly
+    falcate
+  sentences:
+  - 'parhypural: present'
+  - 'Epural: heavy'
+  - 'First infraorbital: short'
+- source_sentence: >-
+    Form of distal portion of anteroventral process of ectopterygoid: varyingly
+    falcate
+  sentences:
+  - 'Dentary and angular: touch'
+  - 'Urohyal and first basibranchial: firmly attached'
+  - 'Supraneural 3-4 (nonadditive): absent'
+- source_sentence: >-
+    Form of distal portion of anteroventral process of ectopterygoid: varyingly
+    falcate
+  sentences:
+  - 'Ventral diverging lamellae of mesethmoid: lamellae reduced or absent'
+  - 'Ventral ridge of the coracoid with a posterior process: absent'
+  - 'carpals: fully or partially ossified'
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+base_model: sentence-transformers/all-mpnet-base-v2
+metrics:
+- pearson_cosine
+- spearman_cosine
+model-index:
+- name: SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
+  results:
+  - task:
+      type: semantic-similarity
+      name: Semantic Similarity
+    dataset:
+      name: pheno dev
+      type: pheno-dev
+    metrics:
+    - type: pearson_cosine
+      value: 0.6082332469417436
+      name: Pearson Cosine
+    - type: spearman_cosine
+      value: 0.6250387873495056
+      name: Spearman Cosine
+  - task:
+      type: semantic-similarity
+      name: Semantic Similarity
+    dataset:
+      name: pheno test
+      type: pheno-test
+    metrics:
+    - type: pearson_cosine
+      value: 0.6822053314599665
+      name: Pearson Cosine
+    - type: spearman_cosine
+      value: 0.705688010939619
+      name: Spearman Cosine
 tags:
 - ontology
 - nlp
 - fish
 - embedding
 - trait
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- loss:CoSENTLoss
 datasets:
 - imageomics/char-sim-data
 model_name: Trait2Vec
+model_description: "Language model for embedding organismal trait descriptions. Built using Sentence-Transformer architecture and trained with trait descriptions from Imageomics/char-sim-data."
 ---
 # Model Card for Trait2Vec
+Trait2Vec is a language model to embed organismal trait descriptions in a way that preserves the structure induced by a semantic similarity metric (e.g. SimGIC). The model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data) dataset. It is fine-tuned from [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). This was removed, should it have been?>>>It maps sentences & paragraphs to a 256-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 Through qualitative data exploration we observe the cosine similarity between embeddings of raw trait description is proportional to the semantic similarity of their corresponding ontological representations.
 ## Model Details
 ### Model Sources
+- **Repository:** [Trait2Vec](https://github.com/Imageomics/char-sim/)
 ## Uses
 ## Bias, Risks, and Limitations
+This model is finetuned from  [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), therefore it inherits its corresponding biases and risks. The training dataset([char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data)) introduces the biases of the single similarity metric and ontology. This means the embedding inherits that metric’s inductive biases, coverage gaps, and evolving definitions. Biological conclusions may differ under alternative metrics (e.g., Resnik, Jaccard) or other phenotype ontologies.
 ### Recommendations
 ### Training Data
+This model was trained on the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data) dataset.
 * Size: 438,516 training samples
 * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>score</code>
 ## Evaluation
+We tested Trait2Vec on a hold-out split of 20\% of the [char-sim-data](https://huggingface.co/datasets/imageomics/char-sim-data/) dataset. No descriptor overlap was ensured.
 ### Testing Data, Factors & Metrics
 #### Metrics
+**Semantic Similarity:**
+* Datasets: `pheno-dev` and `pheno-test`
 * Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)
 ## Citation
+If you use this model in your research, please cite both it and the source model & method from which it was fine-tuned:
+### Model
+```bibtex
+@software{trait2vec2025,
+  author = {Jim Balhoff and Soumyashree Kar and Hilmar Lapp and Juan Garcia},
+  doi = {<doi once generated>},
+  title = {Trait2Vec},
+  version = {1.0.0},
+  year = {2025},
+  url = {https://huggingface.co/imageomics/trait2vec}
+}
+```
+### Source Model & Method
 #### Sentence Transformers
 ```bibtex
 @inproceedings{reimers-2019-sentence-bert,