|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- andrewdalpino/ESM2-150M-Protein-Molecular-Function |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- biology |
|
|
- protein |
|
|
--- |
|
|
|
|
|
# Contrastively Learned Attention based Stratified PTM Predictor (CLASPP) a unified PTM prediction model |
|
|
|
|
|
|
|
|
|
|
|
<p align="center"> |
|
|
<img width="100%" src= "figures/Screenshot%20from%202025-07-11%2014-10-57.png"> |
|
|
</p> |
|
|
|
|
|
|
|
|
CLASPP is a ESM2-150m protein lanuguage model that can pred PTM envents occuring on the substrate based |
|
|
off primary protein sequence. This is done on multiple differnt PTM types (12) as a form of multi-label |
|
|
classifcation. The encoder is training on a supervised Contrastive learing task then the classifcation |
|
|
head is finetunted on the multi-label classifcation. Existing PTM prediction models predominantly focus |
|
|
on either single PTM types or employ ensemble methods that combine multiple models to predict different |
|
|
PTM types. This fragmentation is largely driven by the vast imbalance in data availability across PTM |
|
|
types making it difficult to predict multiple PTM types with a single model. To address this |
|
|
limitation, we present the Contrastively Learned Attention-Based Stratified PTM Predictor (CLASPP), |
|
|
a unified PTM prediction model. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Quick overview of the dependencies |
|
|
|
|
|
 |
|
|
 |
|
|
 |
|
|
 |
|
|
|
|
|
### From conda: |
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
### From pip: |
|
|
|
|
|
    |
|
|
|
|
|
|
|
|
### For torch/PyTorch |
|
|
|
|
|
Make sure you go to this website [pytorch](https://pytorch.org/get-started/locally/) |
|
|
|
|
|
Follow along with its recommendation |
|
|
|
|
|
Installing torch can be the most complex part |
|
|
|
|
|
# How to Get Started with the Model |
|
|
|
|
|
|
|
|
### Downloading this repository |
|
|
|
|
|
make sure [git lfs](https://git-lfs.com/) is installed |
|
|
|
|
|
Can not store weight files here (too big) |
|
|
|
|
|
``` |
|
|
git clone https://huggingface.co/esbglab/Claspp_forward |
|
|
``` |
|
|
|
|
|
|
|
|
``` |
|
|
cd Claspp_forward |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Creating this conda environment |
|
|
|
|
|
|
|
|
Just type these lines of code into the terminal after you download this repository (this assumes you have [anaconda](https://www.anaconda.com/) already installed) |
|
|
|
|
|
``` |
|
|
conda create -n claspp_forward python=3.9.23 |
|
|
``` |
|
|
|
|
|
``` |
|
|
conda deactivate |
|
|
``` |
|
|
|
|
|
``` |
|
|
conda activate claspp_forward |
|
|
``` |
|
|
|
|
|
``` |
|
|
pip3 install numpy==2.0.2 |
|
|
``` |
|
|
|
|
|
``` |
|
|
pip3 install transformers==4.53.2 |
|
|
``` |
|
|
|
|
|
``` |
|
|
pip3 install datasets==4.0.0 |
|
|
``` |
|
|
|
|
|
|
|
|
### **For torch you will have to download to the torch's specification if you want gpu acceleration from this website ** [pytroch](https://pytorch.org/get-started/locally/) ** |
|
|
|
|
|
|
|
|
|
|
|
``` |
|
|
pip3 install torch torchvision torchaudio |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### the terminal line above might look different for you |
|
|
|
|
|
|
|
|
|
|
|
We provided code to test CLASPP (see section below) |
|
|
|
|
|
|
|
|
:tada: you are know ready to run the code :tada: |
|
|
|
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
|
|
|
|
|
|
<p align="center"> |
|
|
<img width="100%" src= "figures/Screenshot%20from%202025-08-05%2011-49-49.png"> |
|
|
</p> |
|
|
|
|
|
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| PTM type | Residue trained on | Number of clusters allocated|output indexes|input label indexes (training)| |
|
|
| -------------------- | ------------- |--------------------------|------------|-------------| |
|
|
| ST_Phosphorylation | S,T | 5 | 0 or 1 | 0-4 | |
|
|
| Y_Phosphorylation | Y | 1 | 3 | 25 | |
|
|
| K_Ubiquitination | K | 20 | 2 | 5-24 | |
|
|
| K_Acetylation | K | 10 | 4 | 26-35 | |
|
|
| AM_Acetylation | A,M | 1 | 13 or 14 | 49 | |
|
|
| N_N-linked-Glycosylation | N | 1 | 5 | 36 | |
|
|
| ST_O-linked-Glycosylation | S,T | 5 | 6 or 7 | 37-41 | |
|
|
| RK_Methylation | R,K | 4 | 8 or 9 | 42-45 | |
|
|
| K_Sumoylation | K | 1 | 10 | 46 | |
|
|
| K_Malonylation | K | 1 | 11 | 53 | |
|
|
| M_Sulfoxidation | M | 1 | 12 | 48 | |
|
|
| C_Glutathionylation | C | 1 | 15 | 50 | |
|
|
| C_S-palmitoylation | C | 1 | 16 | 51 | |
|
|
| PK_Hydroxylation | P,K | 1 | 17 or 18 | 52 | |
|
|
|negitve| all res | N/A | 19 | 53| |
|
|
|
|
|
|
|
|
## Data organization and number of clusters |
|
|
|
|
|
<p align="center"> |
|
|
<img width="100%" src= "figures/Screenshot%20from%202025-08-05%2011-48-48.png"> |
|
|
</p> |
|
|
|
|
|
|
|
|
| Repo | Link | Discription| |
|
|
| ------------- | ------------- |------------------------------------------| |
|
|
| GitHub | [github version Data_cur](https://github.com/gravelCompBio/Claspp_data_cur) | This verstion contains code but but no data. It needs you to run the code to generate all the helper-files (will take some time run this code)| |
|
|
| GitHub | [github version Forward](https://github.com/gravelCompBio/Claspp_forward) | This verstion contains code but NOT any weights (file too big for github)| |
|
|
| Huggingface | [huggingface version Forward](https://huggingface.co/esbglab/Claspp_forward) | This verstion contains code and training weights | |
|
|
| Zenodo | [zenodo version training_data](https://zenodo.org/records/16739128?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6IjY4MDBkYjUwLTVlZjktNDZlNy04MjZjLTgzZjA0NjZiYmZlYyIsImRhdGEiOnt9LCJyYW5kb20iOiJlMThhZGNlMWUxN2EzNjYxNzllYjg5MWRiZjhiMWYxNSJ9.7Os5ZzQLT3TJu3Clv1Sxvh8oVtFTxxoeYLACFgKwZRjCApQdfQO2-AvctQ-eIIEojKTBGHLCcHlMTDG38AKn8A) | zenodo version of training/testing/validation data| |
|
|
| webtool | [website version of webtool](https://esbg.bmb.uga.edu/claspp/) | webtool hosted on a server| |
|
|
|
|
|
|
|
|
- **Repository:** [More Information Needed] |
|
|
- **Paper [optional]:** [More Information Needed] |
|
|
- **Demo [optional]:** [More Information Needed] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Uses |
|
|
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
### Direct Use |
|
|
``` |
|
|
Usage: python3 claspp_forward.py [OPTION]... --input INPUT [FASTA_FILE or TXT_FILE]... |
|
|
predict PTM events on peptides or full sequences |
|
|
|
|
|
Example 1: python3 claspp_forward.py -B 100 -S 0 -i random.txt |
|
|
Example 2: python3 claspp_forward.py -B 50 -S 1 -i random.fasta |
|
|
|
|
|
FASTA_FILE contain protein sequences in proper fasta or a2m format |
|
|
TXT_FILE cointain protien peptides 21 in length with the center |
|
|
residue being the PTM modification site |
|
|
|
|
|
|
|
|
Pattern selection and interpretation: |
|
|
-B, --batch_size (int) that describes how many predictions |
|
|
can be predicted at a time on the GPU |
|
|
(reduce if you get run out of GPU space) |
|
|
|
|
|
-S --scrape_fasta (int) should be a 1 or a 0 |
|
|
1 = read a fasta and scrape posible 21 peptides |
|
|
that can be modified by a PTM |
|
|
0 = read a txt file that has the 21mer already |
|
|
sperated and all peptides should be sperated by |
|
|
a '\\n' (can be faster) than fasta option |
|
|
|
|
|
-h --help your reading it right now |
|
|
|
|
|
-i --input location of the input fasta or txt |
|
|
|
|
|
-o --output location of the output csv |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- **Developed by:** Major author for most code Nathan Gravel. |
|
|
- Finetuning code inspried by Zhongliang Zhou. |
|
|
- Contrastive learing code inspried by Ruili Fang. |
|
|
- Codebase testing and verstion controle by Austin Downes. |
|
|
- Webtool dev Saber Soleymani. |
|
|
- **Funded by [optional]:** [NIH] |
|
|
- **Shared by [optional]:** [More Information Neede] |
|
|
- **Model type:** [Text classication] |
|
|
- **Language(s) (NLP):** [Protein Sequence] |
|
|
- **License:** [MIT] |
|
|
- **Finetuned from model [optional]:** [ESM-2 150M] |
|
|
|
|
|
|