yosshstd commited on
Commit
a5bbba4
·
verified ·
1 Parent(s): 3951c2a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -26
README.md CHANGED
@@ -1,4 +1,5 @@
1
  ---
 
2
  tags:
3
  - sentence-transformers
4
  - sentence-similarity
@@ -6,11 +7,15 @@ tags:
6
  - dense
7
  pipeline_tag: sentence-similarity
8
  library_name: sentence-transformers
 
 
9
  ---
10
 
11
- # SentenceTransformer
 
12
 
13
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 
14
 
15
  ## Model Details
16
 
@@ -24,12 +29,6 @@ This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps
24
  <!-- - **Language:** Unknown -->
25
  <!-- - **License:** Unknown -->
26
 
27
- ### Model Sources
28
-
29
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
30
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
31
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
32
-
33
  ### Full Model Architecture
34
 
35
  ```
@@ -56,25 +55,38 @@ Then you can load this model and run inference.
56
  from sentence_transformers import SentenceTransformer
57
 
58
  # Download from the 🤗 Hub
59
- model = SentenceTransformer("sentence_transformers_model_id")
 
 
60
  # Run inference
61
- sentences = [
62
- 'The weather is lovely today.',
63
- "It's so sunny outside!",
64
- 'He drove to the stadium.',
65
- ]
66
- embeddings = model.encode(sentences)
67
- print(embeddings.shape)
68
- # [3, 1024]
69
-
70
- # Get the similarity scores for the embeddings
71
- similarities = model.similarity(embeddings, embeddings)
72
- print(similarities)
73
- # tensor([[1.0000, 0.7137, 0.6395],
74
- # [0.7137, 1.0000, 0.5949],
75
- # [0.6395, 0.5949, 1.0000]])
76
  ```
77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  <!--
79
  ### Direct Usage (Transformers)
80
 
@@ -123,8 +135,16 @@ You can finetune this model on your own dataset.
123
  - Tokenizers: 0.21.2
124
 
125
  ## Citation
126
-
127
- ### BibTeX
 
 
 
 
 
 
 
 
128
 
129
  <!--
130
  ## Glossary
 
1
  ---
2
+ license: mit
3
  tags:
4
  - sentence-transformers
5
  - sentence-similarity
 
7
  - dense
8
  pipeline_tag: sentence-similarity
9
  library_name: sentence-transformers
10
+ base_model:
11
+ - westlake-repl/ProTrek_650M_UniRef50
12
  ---
13
 
14
+ # ProTrek_650M_UniRef50_text_encoder
15
+ This model is a SentenceTransformer-compatible version of ProTrek_650M_UniRef50_text_encoder. It has been converted for use with the sentence-transformers library, enabling easy integration into semantic similarity tasks, such as semantic search, clustering, and feature extraction.
16
 
17
+ **Github repo: https://github.com/westlake-repl/ProTrek**
18
+ **Hugging Face repo https://huggingface.co/westlake-repl/ProTrek_650M_UniRef50**
19
 
20
  ## Model Details
21
 
 
29
  <!-- - **Language:** Unknown -->
30
  <!-- - **License:** Unknown -->
31
 
 
 
 
 
 
 
32
  ### Full Model Architecture
33
 
34
  ```
 
55
  from sentence_transformers import SentenceTransformer
56
 
57
  # Download from the 🤗 Hub
58
+ protein_encoder = SentenceTransformer("yosshstd/ProTrek_650M_UniRef50_protein_encoder")
59
+ text_encoder = SentenceTransformer("yosshstd/ProTrek_650M_UniRef50_text_encoder")
60
+ structure_encoder = SentenceTransformer("yosshstd/ProTrek_650M_UniRef50_structure_encoder")
61
  # Run inference
62
+ ProTrek_650M_UniRef50_temperature = 0.0186767578
63
+ def sim(a, b): return (a @ b.T / ProTrek_650M_UniRef50_temperature).item()
64
+
65
+ aa_seq = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
66
+ text = 'Insulin decreases blood glucose concentration. It increases cell permeability to monosaccharides, amino acids and fatty acids. It accelerates glycolysis, the pentose phosphate cycle, and glycogen synthesis in liver.'
67
+ foldseek_seq = 'DVVVVVVVVVVVVCVVPPDDPVPPFDFDFDADVVLVVLLCVLLVPLAFDDDDPDPVVVVVVVVDDDPPDDDPPDPDPDPPVVVVVVVVDDCSVVRRVGIDGSVSSNVRGD'.lower()
68
+
69
+ seq_emb = protein_encoder.encode([aa_seq], convert_to_tensor=True)
70
+ text_emb = text_encoder.encode([text], convert_to_tensor=True)
71
+ struc_emb = structure_encoder.encode([foldseek_seq], convert_to_tensor=True)
72
+ print("Seq-Text similarity:", sim(seq_emb, text_emb))
73
+ print("Seq-Structure similarity:", sim(seq_emb, struc_emb))
74
+ print("Text-Structure similarity:", sim(text_emb, struc_emb))
 
 
75
  ```
76
 
77
+ ## Overview
78
+ ProTrek is a multimodal model that integrates protein sequence, protein structure, and text information for better
79
+ protein understanding. It adopts contrastive learning to learn the representations of protein sequence and structure.
80
+ During the pre-training phase, we calculate the InfoNCE loss for each two modalities as [CLIP](https://arxiv.org/abs/2103.00020)
81
+ does.
82
+
83
+ ## Model architecture
84
+ **Protein sequence encoder**: [esm2_t33_650M_UR50D](https://huggingface.co/facebook/esm2_t33_650M_UR50D)
85
+
86
+ **Protein structure encoder**: foldseek_t30_150M (identical architecture with esm2 except that the vocabulary only contains 3Di tokens)
87
+
88
+ **Text encoder**: [BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext)
89
+
90
  <!--
91
  ### Direct Usage (Transformers)
92
 
 
135
  - Tokenizers: 0.21.2
136
 
137
  ## Citation
138
+ ```
139
+ @article{su2024protrek,
140
+ title={ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning},
141
+ author={Su, Jin and Zhou, Xibin and Zhang, Xuting and Yuan, Fajie},
142
+ journal={bioRxiv},
143
+ pages={2024--05},
144
+ year={2024},
145
+ publisher={Cold Spring Harbor Laboratory}
146
+ }
147
+ ```
148
 
149
  <!--
150
  ## Glossary