File size: 8,078 Bytes
dbdae42 99dcca9 875361a 99dcca9 63ee1dc 4eaa664 651fd55 4eaa664 63ee1dc c601043 875361a c601043 be40adc e945f7f be40adc 875361a 23a7acf 0fc180d b939883 0fc180d 875361a 0fc180d 43ecee7 0fc180d a652f0d 0fc180d e945f7f a577360 e945f7f c0330a8 16b6633 c0330a8 16b6633 c0330a8 16b6633 875361a a128435 875361a 115258d 875361a 17ebcec 875361a a128435 875361a b424988 d0a3eea b424988 875361a c0330a8 dbdae42 c0330a8 dbdae42 c0330a8 dbdae42 c0330a8 dbdae42 c0330a8 dbdae42 c0330a8 dbdae42 c0330a8 dbdae42 c0330a8 dbdae42 c0330a8 dbdae42 c0330a8 dbdae42 875361a 8e5e37c 23a7acf dbdae42 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 |
---
license: mit
language:
- en
base_model:
- andrewdalpino/ESM2-150M-Protein-Molecular-Function
pipeline_tag: text-classification
tags:
- biology
- protein
---
# Contrastively Learned Attention based Stratified PTM Predictor (CLASPP) a unified PTM prediction model
<p align="center">
<img width="100%" src= "figures/Screenshot%20from%202025-07-11%2014-10-57.png">
</p>
CLASPP is a ESM2-150m protein lanuguage model that can pred PTM envents occuring on the substrate based
off primary protein sequence. This is done on multiple differnt PTM types (12) as a form of multi-label
classifcation. The encoder is training on a supervised Contrastive learing task then the classifcation
head is finetunted on the multi-label classifcation. Existing PTM prediction models predominantly focus
on either single PTM types or employ ensemble methods that combine multiple models to predict different
PTM types. This fragmentation is largely driven by the vast imbalance in data availability across PTM
types making it difficult to predict multiple PTM types with a single model. To address this
limitation, we present the Contrastively Learned Attention-Based Stratified PTM Predictor (CLASPP),
a unified PTM prediction model.
## Quick overview of the dependencies




### From conda:

### From pip:
   
### For torch/PyTorch
Make sure you go to this website [pytorch](https://pytorch.org/get-started/locally/)
Follow along with its recommendation
Installing torch can be the most complex part
# How to Get Started with the Model
### Downloading this repository
make sure [git lfs](https://git-lfs.com/) is installed
Can not store weight files here (too big)
```
git clone https://huggingface.co/esbglab/Claspp_forward
```
```
cd Claspp_forward
```
### Creating this conda environment
Just type these lines of code into the terminal after you download this repository (this assumes you have [anaconda](https://www.anaconda.com/) already installed)
```
conda create -n claspp_forward python=3.9.23
```
```
conda deactivate
```
```
conda activate claspp_forward
```
```
pip3 install numpy==2.0.2
```
```
pip3 install transformers==4.53.2
```
```
pip3 install datasets==4.0.0
```
### **For torch you will have to download to the torch's specification if you want gpu acceleration from this website ** [pytroch](https://pytorch.org/get-started/locally/) **
```
pip3 install torch torchvision torchaudio
```
### the terminal line above might look different for you
We provided code to test CLASPP (see section below)
:tada: you are know ready to run the code :tada:
Use the code below to get started with the model.
## Model Details
<p align="center">
<img width="100%" src= "figures/Screenshot%20from%202025-08-05%2011-49-49.png">
</p>
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130.
| PTM type | Residue trained on | Number of clusters allocated|output indexes|input label indexes (training)|
| -------------------- | ------------- |--------------------------|------------|-------------|
| ST_Phosphorylation | S,T | 5 | 0 or 1 | 0-4 |
| Y_Phosphorylation | Y | 1 | 3 | 25 |
| K_Ubiquitination | K | 20 | 2 | 5-24 |
| K_Acetylation | K | 10 | 4 | 26-35 |
| AM_Acetylation | A,M | 1 | 13 or 14 | 49 |
| N_N-linked-Glycosylation | N | 1 | 5 | 36 |
| ST_O-linked-Glycosylation | S,T | 5 | 6 or 7 | 37-41 |
| RK_Methylation | R,K | 4 | 8 or 9 | 42-45 |
| K_Sumoylation | K | 1 | 10 | 46 |
| K_Malonylation | K | 1 | 11 | 53 |
| M_Sulfoxidation | M | 1 | 12 | 48 |
| C_Glutathionylation | C | 1 | 15 | 50 |
| C_S-palmitoylation | C | 1 | 16 | 51 |
| PK_Hydroxylation | P,K | 1 | 17 or 18 | 52 |
|negitve| all res | N/A | 19 | 53|
## Data organization and number of clusters
<p align="center">
<img width="100%" src= "figures/Screenshot%20from%202025-08-05%2011-48-48.png">
</p>
| Repo | Link | Discription|
| ------------- | ------------- |------------------------------------------|
| GitHub | [github version Data_cur](https://github.com/gravelCompBio/Claspp_data_cur) | This verstion contains code but but no data. It needs you to run the code to generate all the helper-files (will take some time run this code)|
| GitHub | [github version Forward](https://github.com/gravelCompBio/Claspp_forward) | This verstion contains code but NOT any weights (file too big for github)|
| Huggingface | [huggingface version Forward](https://huggingface.co/esbglab/Claspp_forward) | This verstion contains code and training weights |
| Zenodo | [zenodo version training_data](https://zenodo.org/records/16739128?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6IjY4MDBkYjUwLTVlZjktNDZlNy04MjZjLTgzZjA0NjZiYmZlYyIsImRhdGEiOnt9LCJyYW5kb20iOiJlMThhZGNlMWUxN2EzNjYxNzllYjg5MWRiZjhiMWYxNSJ9.7Os5ZzQLT3TJu3Clv1Sxvh8oVtFTxxoeYLACFgKwZRjCApQdfQO2-AvctQ-eIIEojKTBGHLCcHlMTDG38AKn8A) | zenodo version of training/testing/validation data|
| webtool | [website version of webtool](https://esbg.bmb.uga.edu/claspp/) | webtool hosted on a server|
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
```
Usage: python3 claspp_forward.py [OPTION]... --input INPUT [FASTA_FILE or TXT_FILE]...
predict PTM events on peptides or full sequences
Example 1: python3 claspp_forward.py -B 100 -S 0 -i random.txt
Example 2: python3 claspp_forward.py -B 50 -S 1 -i random.fasta
FASTA_FILE contain protein sequences in proper fasta or a2m format
TXT_FILE cointain protien peptides 21 in length with the center
residue being the PTM modification site
Pattern selection and interpretation:
-B, --batch_size (int) that describes how many predictions
can be predicted at a time on the GPU
(reduce if you get run out of GPU space)
-S --scrape_fasta (int) should be a 1 or a 0
1 = read a fasta and scrape posible 21 peptides
that can be modified by a PTM
0 = read a txt file that has the 21mer already
sperated and all peptides should be sperated by
a '\\n' (can be faster) than fasta option
-h --help your reading it right now
-i --input location of the input fasta or txt
-o --output location of the output csv
```
- **Developed by:** Major author for most code Nathan Gravel.
- Finetuning code inspried by Zhongliang Zhou.
- Contrastive learing code inspried by Ruili Fang.
- Codebase testing and verstion controle by Austin Downes.
- Webtool dev Saber Soleymani.
- **Funded by [optional]:** [NIH]
- **Shared by [optional]:** [More Information Neede]
- **Model type:** [Text classication]
- **Language(s) (NLP):** [Protein Sequence]
- **License:** [MIT]
- **Finetuned from model [optional]:** [ESM-2 150M]
|