fondress
/

PDeepPP_ACE

@@ -1,94 +1,112 @@
 ---
-pipeline_tag: text-classification
 tags:
 - protein language model
-------
-# TransHLA model
-`TransHLA` is a tool designed to discern whether a peptide will be recognized by HLA as an epitope.`TransHLA` is the first tool capable of directly identifying peptides as epitopes without the need for inputting HLA alleles. Due the different length of epitopes, we trained two models. The first is TransHLA_I, which is used for the detection of the HLA-I epitope, the other is TransHLA_II, which is used for the detection of the HLA-II epitope.
 ## Model description
-   `TransHLA` is a hybrid transformer model that utilizes a transformer encoder module and a deep CNN module. It is trained using pretrained sequence embeddings from `ESM2` and contact map structural features as inputs. It can serve as a preliminary screening for the currently popular tools that are specific for HLA-epitope binding affinity.
 ## Intended uses
-Due to variations in peptide lengths, our TransHLA is divided into TransHLA_I and TransHLA_II, which are used to separately identify epitopes presented by HLA class I and class II molecules, respectively. Specifically, TransHLA_I is designed for shorter peptides ranging from 8 to 14 amino acids in length, while TransHLA_II targets longer peptides with lengths of 13 to 21 amino acids. The output consists of two parts. The first output indicates whether the peptide is an epitope, presented in a two-column format where each row contains two numbers that sum to 1, representing probabilities. If the number in the second column is greater than or equal to 0.5, the peptide is classified as an epitope; otherwise, it is considered a normal peptide.
-The second output is the sequence embedding generated by the model.
- For both models, we have written separate tutorials in this file to facilitate ease of use.
-### How to use
-First, users need to download the following packages: `pytorch`, `fair-esm`, and `transformers`. Additionally, the CUDA version must be 11.8 or higher; otherwise, the model will need to be run on CPU.
-```
-pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-pip install transformers
-pip install fair-esm
-```
-Here is how to use TransHLA_I model to predict whether a peptide is an epitope:
-```python
-from transformers import AutoTokenizer
-from transformers import AutoModel
-import torch
-def pad_inner_lists_to_length(outer_list,target_length=16):
-    for inner_list in outer_list:
-        padding_length = target_length - len(inner_list)
-        if padding_length > 0:
-            inner_list.extend([1] * padding_length)
-    return outer_list
-if __name__ == "__main__":
-    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-    print(f"Using {device} device")
-    tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
-    model = AutoModel.from_pretrained("SkywalkerLu/TransHLA_I", trust_remote_code=True)
-    model.to(device)
-    peptide_examples = ['EDSAIVTPSR','SVWEPAKAKYVFR']
-    peptide_encoding = tokenizer(peptide_examples)['input_ids']
-    peptide_encoding = pad_inner_lists_to_length(peptide_encoding)
-    print(peptide_encoding)
-    peptide_encoding = torch.tensor(peptide_encoding)
-    outputs,representations = model(peptide_encoding.to(device))
-    print(outputs)
-    print(representations)
-```
-And here is how to use TransHLA_II model to predict the peptide whether epitope:
-```python
-from transformers import AutoTokenizer
-from transformers import AutoModel
-import torch
-def pad_inner_lists_to_length(outer_list,target_length=23):
-    for inner_list in outer_list:
-        padding_length = target_length - len(inner_list)
-        if padding_length > 0:
-            inner_list.extend([1] * padding_length)
-    return outer_list
-if __name__ == "__main__":
-    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-    print(f"Using {device} device")
-    tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
-    model = AutoModel.from_pretrained("SkywalkerLu/TransHLA_II", trust_remote_code=True)
-    model.to(device)
-    model.eval()
-    peptide_examples = ['KMIYSYSSHAASSL','ARGDFFRATSRLTTDFG']
-    peptide_encoding = tokenizer(peptide_examples)['input_ids']
-    peptide_encoding = pad_inner_lists_to_length(peptide_encoding)
-    peptide_encoding = torch.tensor(peptide_encoding)
-    outputs,representations = model(peptide_encoding.to(device))
-    print(outputs)
-	print(representations)
-```

 ---
 tags:
 - protein language model
+datasets:
+- IEDB
+---
+# PDeepPP model
+`PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts.
 ## Model description
+`PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of:
+1. A **Self-Attention Global Features module** for capturing long-range dependencies.
+2. A **TransConv1d module**, combining transformers and convolutional layers.
+3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction.
+The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows.
 ## Intended uses
+`PDeepPP` is designed for two primary tasks:
+1. **PTM site prediction**: Identifying post-translational modification sites (e.g., phosphorylation) in protein sequences, focusing on serine (S), threonine (T), and tyrosine (Y) residues.
+2. **Biologically active sequence analysis (BPS)**: Extracting biologically relevant regions from protein sequences for downstream analysis.
+The model processes protein sequences and outputs:
+- Embedded representations of the sequences, which can be used for various downstream tasks.
+- Predicted probabilities for PTM or other sequence-specific features.
+### Key features:
+- **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to predict PTM activity.
+- **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein for broader biological insights.
+## How to use
+To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`:
+```bash
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+pip install transformers
+Here is an example of how to use PDeepPP to process protein sequences and obtain predictions:
+Example for PTM mode:
+import torch
+from transformers import AutoModel, AutoTokenizer
+# Load PDeepPP model
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(f"Using {device} device")
+model = AutoModel.from_pretrained("YourModelName/PDeepPP", trust_remote_code=True)
+model.to(device)
+# Example protein sequences
+protein_sequences = ["MKVSTYSTQ", "MSRSTYV"]
+# Preprocess sequences (PTM mode)
+from processing_pdeeppp import PDeepPPProcessor
+processor = PDeepPPProcessor(pad_char="X", target_length=33)
+inputs = processor(sequences=protein_sequences, ptm_mode=True, return_tensors="pt")
+# Make predictions
+model.eval()
+outputs = model(**inputs)
+print(outputs["logits"])
+Example for BPS mode:
+import torch
+from transformers import AutoModel
+# Load PDeepPP model
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(f"Using {device} device")
+model = AutoModel.from_pretrained("YourModelName/PDeepPP", trust_remote_code=True)
+model.to(device)
+# Example protein sequences
+protein_sequences = ["MKVSTYSTQ", "MSRSTYV"]
+# Preprocess sequences (BPS mode)
+from processing_pdeeppp import PDeepPPProcessor
+processor = PDeepPPProcessor(pad_char="X", target_length=33)
+inputs = processor(sequences=protein_sequences, ptm_mode=False, overlapping=True, step_size=5, return_tensors="pt")
+# Make predictions
+model.eval()
+outputs = model(**inputs)
+print(outputs["logits"])
+Training and customization
+PDeepPP supports fine-tuning on custom datasets. The model uses a configuration class (PDeepPPConfig) to specify hyperparameters such as:
+Number of transformer layers
+Hidden layer size
+Dropout rate
+PTM type and other task-specific parameters
+Refer to PDeepPPConfig for details.
+Citation
+If you use PDeepPP in your research, please cite the associated paper or repository:
+@article{your_reference,
+  title={PDeepPP: A Hybrid Model for Protein Sequence Analysis},
+  author={Author Name},
+  journal={Journal Name},
+  year={2025}
+}