fondress commited on
Commit
2033bef
·
verified ·
1 Parent(s): bffc285

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -71
README.md CHANGED
@@ -1,94 +1,112 @@
1
  ---
2
- pipeline_tag: text-classification
3
  tags:
4
  - protein language model
5
- ------
6
-
7
- # TransHLA model
8
 
9
- `TransHLA` is a tool designed to discern whether a peptide will be recognized by HLA as an epitope.`TransHLA` is the first tool capable of directly identifying peptides as epitopes without the need for inputting HLA alleles. Due the different length of epitopes, we trained two models. The first is TransHLA_I, which is used for the detection of the HLA-I epitope, the other is TransHLA_II, which is used for the detection of the HLA-II epitope.
10
 
 
11
 
12
  ## Model description
13
- `TransHLA` is a hybrid transformer model that utilizes a transformer encoder module and a deep CNN module. It is trained using pretrained sequence embeddings from `ESM2` and contact map structural features as inputs. It can serve as a preliminary screening for the currently popular tools that are specific for HLA-epitope binding affinity.
 
 
 
 
 
 
 
14
 
15
  ## Intended uses
16
 
17
- Due to variations in peptide lengths, our TransHLA is divided into TransHLA_I and TransHLA_II, which are used to separately identify epitopes presented by HLA class I and class II molecules, respectively. Specifically, TransHLA_I is designed for shorter peptides ranging from 8 to 14 amino acids in length, while TransHLA_II targets longer peptides with lengths of 13 to 21 amino acids. The output consists of two parts. The first output indicates whether the peptide is an epitope, presented in a two-column format where each row contains two numbers that sum to 1, representing probabilities. If the number in the second column is greater than or equal to 0.5, the peptide is classified as an epitope; otherwise, it is considered a normal peptide.
18
- The second output is the sequence embedding generated by the model.
19
- For both models, we have written separate tutorials in this file to facilitate ease of use.
20
 
21
- ### How to use
22
- First, users need to download the following packages: `pytorch`, `fair-esm`, and `transformers`. Additionally, the CUDA version must be 11.8 or higher; otherwise, the model will need to be run on CPU.
23
- ```
24
- pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
25
- pip install transformers
26
- pip install fair-esm
27
- ```
28
- Here is how to use TransHLA_I model to predict whether a peptide is an epitope:
29
 
30
- ```python
31
- from transformers import AutoTokenizer
32
- from transformers import AutoModel
33
- import torch
34
 
 
 
35
 
 
 
 
36
 
37
- def pad_inner_lists_to_length(outer_list,target_length=16):
38
- for inner_list in outer_list:
39
- padding_length = target_length - len(inner_list)
40
- if padding_length > 0:
41
- inner_list.extend([1] * padding_length)
42
- return outer_list
43
-
44
-
45
- if __name__ == "__main__":
46
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
47
- print(f"Using {device} device")
48
- tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
49
- model = AutoModel.from_pretrained("SkywalkerLu/TransHLA_I", trust_remote_code=True)
50
- model.to(device)
51
- peptide_examples = ['EDSAIVTPSR','SVWEPAKAKYVFR']
52
- peptide_encoding = tokenizer(peptide_examples)['input_ids']
53
- peptide_encoding = pad_inner_lists_to_length(peptide_encoding)
54
- print(peptide_encoding)
55
- peptide_encoding = torch.tensor(peptide_encoding)
56
- outputs,representations = model(peptide_encoding.to(device))
57
- print(outputs)
58
- print(representations)
59
- ```
60
- And here is how to use TransHLA_II model to predict the peptide whether epitope:
61
-
62
- ```python
63
- from transformers import AutoTokenizer
64
- from transformers import AutoModel
65
- import torch
66
 
 
67
 
 
68
 
 
 
 
 
 
 
 
 
69
 
70
- def pad_inner_lists_to_length(outer_list,target_length=23):
71
- for inner_list in outer_list:
72
- padding_length = target_length - len(inner_list)
73
- if padding_length > 0:
74
- inner_list.extend([1] * padding_length)
75
- return outer_list
76
 
 
 
 
 
77
 
 
 
 
 
78
 
79
- if __name__ == "__main__":
80
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
81
- print(f"Using {device} device")
82
- tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
83
- model = AutoModel.from_pretrained("SkywalkerLu/TransHLA_II", trust_remote_code=True)
84
- model.to(device)
85
- model.eval()
86
- peptide_examples = ['KMIYSYSSHAASSL','ARGDFFRATSRLTTDFG']
87
- peptide_encoding = tokenizer(peptide_examples)['input_ids']
88
- peptide_encoding = pad_inner_lists_to_length(peptide_encoding)
89
- peptide_encoding = torch.tensor(peptide_encoding)
90
- outputs,representations = model(peptide_encoding.to(device))
91
- print(outputs)
92
- print(representations)
93
 
94
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  tags:
3
  - protein language model
4
+ datasets:
5
+ - IEDB
6
+ ---
7
 
8
+ # PDeepPP model
9
 
10
+ `PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts.
11
 
12
  ## Model description
13
+
14
+ `PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of:
15
+
16
+ 1. A **Self-Attention Global Features module** for capturing long-range dependencies.
17
+ 2. A **TransConv1d module**, combining transformers and convolutional layers.
18
+ 3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction.
19
+
20
+ The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows.
21
 
22
  ## Intended uses
23
 
24
+ `PDeepPP` is designed for two primary tasks:
 
 
25
 
26
+ 1. **PTM site prediction**: Identifying post-translational modification sites (e.g., phosphorylation) in protein sequences, focusing on serine (S), threonine (T), and tyrosine (Y) residues.
27
+ 2. **Biologically active sequence analysis (BPS)**: Extracting biologically relevant regions from protein sequences for downstream analysis.
 
 
 
 
 
 
28
 
29
+ The model processes protein sequences and outputs:
 
 
 
30
 
31
+ - Embedded representations of the sequences, which can be used for various downstream tasks.
32
+ - Predicted probabilities for PTM or other sequence-specific features.
33
 
34
+ ### Key features:
35
+ - **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to predict PTM activity.
36
+ - **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein for broader biological insights.
37
 
38
+ ## How to use
39
+
40
+ To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`:
41
+
42
+ ```bash
43
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
44
+ pip install transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
+ Here is an example of how to use PDeepPP to process protein sequences and obtain predictions:
47
 
48
+ Example for PTM mode:
49
 
50
+ import torch
51
+ from transformers import AutoModel, AutoTokenizer
52
+
53
+ # Load PDeepPP model
54
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
55
+ print(f"Using {device} device")
56
+ model = AutoModel.from_pretrained("YourModelName/PDeepPP", trust_remote_code=True)
57
+ model.to(device)
58
 
59
+ # Example protein sequences
60
+ protein_sequences = ["MKVSTYSTQ", "MSRSTYV"]
 
 
 
 
61
 
62
+ # Preprocess sequences (PTM mode)
63
+ from processing_pdeeppp import PDeepPPProcessor
64
+ processor = PDeepPPProcessor(pad_char="X", target_length=33)
65
+ inputs = processor(sequences=protein_sequences, ptm_mode=True, return_tensors="pt")
66
 
67
+ # Make predictions
68
+ model.eval()
69
+ outputs = model(**inputs)
70
+ print(outputs["logits"])
71
 
72
+ Example for BPS mode:
73
+ import torch
74
+ from transformers import AutoModel
 
 
 
 
 
 
 
 
 
 
 
75
 
76
+ # Load PDeepPP model
77
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
78
+ print(f"Using {device} device")
79
+ model = AutoModel.from_pretrained("YourModelName/PDeepPP", trust_remote_code=True)
80
+ model.to(device)
81
+
82
+ # Example protein sequences
83
+ protein_sequences = ["MKVSTYSTQ", "MSRSTYV"]
84
+
85
+ # Preprocess sequences (BPS mode)
86
+ from processing_pdeeppp import PDeepPPProcessor
87
+ processor = PDeepPPProcessor(pad_char="X", target_length=33)
88
+ inputs = processor(sequences=protein_sequences, ptm_mode=False, overlapping=True, step_size=5, return_tensors="pt")
89
+
90
+ # Make predictions
91
+ model.eval()
92
+ outputs = model(**inputs)
93
+ print(outputs["logits"])
94
+
95
+ Training and customization
96
+ PDeepPP supports fine-tuning on custom datasets. The model uses a configuration class (PDeepPPConfig) to specify hyperparameters such as:
97
+
98
+ Number of transformer layers
99
+ Hidden layer size
100
+ Dropout rate
101
+ PTM type and other task-specific parameters
102
+ Refer to PDeepPPConfig for details.
103
+
104
+ Citation
105
+ If you use PDeepPP in your research, please cite the associated paper or repository:
106
+
107
+ @article{your_reference,
108
+ title={PDeepPP: A Hybrid Model for Protein Sequence Analysis},
109
+ author={Author Name},
110
+ journal={Journal Name},
111
+ year={2025}
112
+ }