isikz
/

esm1b_phosphosite_msa_data_mlm_objective

Model card Files Files and versions

isikz commited on Feb 16, 2025

Commit

71d5cfe

·

verified ·

1 Parent(s): 5b6f5e1

Update README.md

Files changed (1) hide show

README.md +5 -4

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ base_model:
 ## **Fine-Tuning ESM-1b with Multiple Sequence Alignment (MSA) for Phosphosites**
-This repository provides a fine-tuned version of ESM-1b, incorporating genomic information by leveraging long phosphosite sequences from [DARKIN dataset](https://openreview.net/pdf?id=a4x5tbYRYV) and Multiple Sequence Alignment (MSA) of those phosphosites. The goal is to enhance the model's understanding of phosphorylation by integrating sequence conservation patterns.
 ### Developed by:
@@ -26,19 +26,20 @@ To construct a robust dataset, we extracted 256 MSA sequences per phosphosite fr
   - 10% of the data was reserved for validation.
   - The remaining 90% was used for fine-tuning with the Masked Language Modeling (MLM) objective.
 3. Data Processing & Preprocessing
-  - Special attention was given to retaining phosphorylation residues within sequences.
   - To optimize memory efficiency, sequence lengths were truncated to 128 amino acids.
 ### Evaluation
 Perplexity: 2.69 (decreased from 7.05)
-from transformers import AutoTokenizer, AutoModelForMaskedLM
-import torch
 ### Usage
 ```
 # Load the model and tokenizer
 model_name = "isikz/phosphosite_msa_finetuned_esm1b"
 tokenizer = AutoTokenizer.from_pretrained(model_name)

 ## **Fine-Tuning ESM-1b with Multiple Sequence Alignment (MSA) for Phosphosites**
+This repository provides a fine-tuned version of ESM-1b with Masked Language Modeling(MLM) Objective, incorporating genomic information by leveraging long phosphosite sequences from [DARKIN dataset](https://openreview.net/pdf?id=a4x5tbYRYV) and Multiple Sequence Alignment (MSA) of those phosphosites. The goal is to enhance the model's understanding of phosphorylation by integrating sequence conservation patterns.
 ### Developed by:
   - 10% of the data was reserved for validation.
   - The remaining 90% was used for fine-tuning with the Masked Language Modeling (MLM) objective.
 3. Data Processing & Preprocessing
+  - Special attention was given to conserving phosphorylation residues within sequences.
   - To optimize memory efficiency, sequence lengths were truncated to 128 amino acids.
 ### Evaluation
 Perplexity: 2.69 (decreased from 7.05)
 ### Usage
 ```
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+import torch
 # Load the model and tokenizer
 model_name = "isikz/phosphosite_msa_finetuned_esm1b"
 tokenizer = AutoTokenizer.from_pretrained(model_name)