ChatterjeeLab
/

FusOn-pLM

Model card Files Files and versions

pranamanam commited on Jun 3, 2024

Commit

580ec07

·

verified ·

1 Parent(s): ca19fb2

Update README.md

Files changed (1) hide show

README.md +25 -5

README.md CHANGED Viewed

@@ -6,12 +6,32 @@ license: cc-by-nc-nd-4.0
 In this work, we introduce **FusOn-pLM**, a novel pLM that fine-tunes state-of-the-art ESM-2 embeddings on fusion oncoprotein sequences, those that drive a large portion of pediatric cancers but are heavily disordered and undruggable, via masked language modeling (MLM). We specifically introduce a novel MLM strategy, employing a binding-site probability predictor to focus masking on key amino acid residues, thereby generating more optimal fusion oncoprotein-aware embeddings. Our model improves performance on both fusion oncoprotein-specific benchmarks and disorder prediction tasks in comparison to baseline ESM-2 representations, as well as manually-constructed biophysical embeddings, motivating downstream usage of FusOn-pLM embeddings for therapeutic design tasks targeting these fusions.
-# How to Use FusOn-pLM
 ```
-# Load model directly
-from transformers import AutoTokenizer, AutoModelForMaskedLM
-tokenizer = AutoTokenizer.from_pretrained("ChatterjeeLab/FusOn-pLM")
-model = AutoModelForMaskedLM.from_pretrained("ChatterjeeLab/FusOn-pLM")
 ```

 In this work, we introduce **FusOn-pLM**, a novel pLM that fine-tunes state-of-the-art ESM-2 embeddings on fusion oncoprotein sequences, those that drive a large portion of pediatric cancers but are heavily disordered and undruggable, via masked language modeling (MLM). We specifically introduce a novel MLM strategy, employing a binding-site probability predictor to focus masking on key amino acid residues, thereby generating more optimal fusion oncoprotein-aware embeddings. Our model improves performance on both fusion oncoprotein-specific benchmarks and disorder prediction tasks in comparison to baseline ESM-2 representations, as well as manually-constructed biophysical embeddings, motivating downstream usage of FusOn-pLM embeddings for therapeutic design tasks targeting these fusions.
+# How to generate FusOn-pLM embeddings for your fusion oncoprotein
 ```
+from transformers import AutoTokenizer, AutoModel
+import torch
+# Load the tokenizer and model
+model_name = "ChatterjeeLab/FusOn-pLM"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name)
+# Example fusion oncoprotein sequence
+sequence = "MKTAYIAKQRQISFVKSHFSRQDILDLWIYHTQGYFPDWQNYTPGLLVEVEVMEVAYGAKMKEGVLI"
+# Tokenize the input sequence
+inputs = tokenizer(sequence, return_tensors="pt")
+# Get the embeddings
+with torch.no_grad():
+    outputs = model(**inputs)
+    # The embeddings are in the last_hidden_state tensor
+    embeddings = outputs.last_hidden_state
+# Convert embeddings to numpy array (if needed)
+embeddings = embeddings.squeeze(0).numpy()
+print("Per-residue embeddings shape:", embeddings.shape)
 ```