Fill-Mask
Transformers
Safetensors
esm
pranamanam commited on
Commit
580ec07
·
verified ·
1 Parent(s): ca19fb2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -5
README.md CHANGED
@@ -6,12 +6,32 @@ license: cc-by-nc-nd-4.0
6
  In this work, we introduce **FusOn-pLM**, a novel pLM that fine-tunes state-of-the-art ESM-2 embeddings on fusion oncoprotein sequences, those that drive a large portion of pediatric cancers but are heavily disordered and undruggable, via masked language modeling (MLM). We specifically introduce a novel MLM strategy, employing a binding-site probability predictor to focus masking on key amino acid residues, thereby generating more optimal fusion oncoprotein-aware embeddings. Our model improves performance on both fusion oncoprotein-specific benchmarks and disorder prediction tasks in comparison to baseline ESM-2 representations, as well as manually-constructed biophysical embeddings, motivating downstream usage of FusOn-pLM embeddings for therapeutic design tasks targeting these fusions.
7
 
8
 
9
- # How to Use FusOn-pLM
10
 
11
  ```
12
- # Load model directly
13
- from transformers import AutoTokenizer, AutoModelForMaskedLM
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- tokenizer = AutoTokenizer.from_pretrained("ChatterjeeLab/FusOn-pLM")
16
- model = AutoModelForMaskedLM.from_pretrained("ChatterjeeLab/FusOn-pLM")
17
  ```
 
6
  In this work, we introduce **FusOn-pLM**, a novel pLM that fine-tunes state-of-the-art ESM-2 embeddings on fusion oncoprotein sequences, those that drive a large portion of pediatric cancers but are heavily disordered and undruggable, via masked language modeling (MLM). We specifically introduce a novel MLM strategy, employing a binding-site probability predictor to focus masking on key amino acid residues, thereby generating more optimal fusion oncoprotein-aware embeddings. Our model improves performance on both fusion oncoprotein-specific benchmarks and disorder prediction tasks in comparison to baseline ESM-2 representations, as well as manually-constructed biophysical embeddings, motivating downstream usage of FusOn-pLM embeddings for therapeutic design tasks targeting these fusions.
7
 
8
 
9
+ # How to generate FusOn-pLM embeddings for your fusion oncoprotein
10
 
11
  ```
12
+ from transformers import AutoTokenizer, AutoModel
13
+ import torch
14
+
15
+ # Load the tokenizer and model
16
+ model_name = "ChatterjeeLab/FusOn-pLM"
17
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
18
+ model = AutoModel.from_pretrained(model_name)
19
+
20
+ # Example fusion oncoprotein sequence
21
+ sequence = "MKTAYIAKQRQISFVKSHFSRQDILDLWIYHTQGYFPDWQNYTPGLLVEVEVMEVAYGAKMKEGVLI"
22
+
23
+ # Tokenize the input sequence
24
+ inputs = tokenizer(sequence, return_tensors="pt")
25
+
26
+ # Get the embeddings
27
+ with torch.no_grad():
28
+ outputs = model(**inputs)
29
+ # The embeddings are in the last_hidden_state tensor
30
+ embeddings = outputs.last_hidden_state
31
+
32
+ # Convert embeddings to numpy array (if needed)
33
+ embeddings = embeddings.squeeze(0).numpy()
34
+
35
+ print("Per-residue embeddings shape:", embeddings.shape)
36
 
 
 
37
  ```