fondress
/

PDeepPP_Methylation-R

@@ -1,110 +1,140 @@
-# PDeepPP: A Comprehensive Protein Language Model Hub
-PDeepPP is a hybrid protein language model designed to predict post-translational modification (PTM) sites, analyze biologically relevant features, and support a wide range of protein sequence analysis tasks. This repository serves as the central hub for accessing and exploring various specialized PDeepPP models, each fine-tuned for specific tasks, such as PTM site prediction, bioactivity analysis, and more.
-## Overview
-PDeepPP integrates state-of-the-art transformer-based self-attention mechanisms with convolutional neural networks (CNNs) to capture both global and local features in protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating modular architecture components, PDeepPP offers a robust framework for protein sequence analysis.
-This repository contains links to multiple task-specific PDeepPP models. These models are pre-trained or fine-tuned on publicly available datasets and are hosted on Hugging Face for easy access.
----
-## Key Features
-- **Flexible Architecture**: Combines self-attention and convolutional operations for robust feature extraction.
-- **Task-Specific Models**: Includes pre-trained models for PTM prediction, bioactivity classification, and more.
-- **Dataset Support**: Models are validated on datasets such as PTM and BPS, ensuring performance on real-world tasks.
-- **Extensibility**: Users can fine-tune the models on custom datasets for new tasks.
----
-## Available Models
-### General Models
-- [PDeepPP Main](https://huggingface.co/fondress/PDeepPP)
-### Task-Specific Models
-#### Post-Translational Modifications (PTMs)
-- [PDeepPP Phosphorylation (Serine)](https://huggingface.co/fondress/PDeepPP_Phosphoserine)
-- [PDeepPP Phosphorylation (Tyrosine)](https://huggingface.co/fondress/PDeepPP_Phosphorylation-Y)
-- [PDeepPP Glycosylation (N-linked)](https://huggingface.co/fondress/PDeepPP_N-linked-glycosylation-N)
-- [PDeepPP Glycosylation (O-linked)](https://huggingface.co/fondress/PDeepPP_O-linked-glycosylation)
-- [PDeepPP Methylation (Lysine)](https://huggingface.co/fondress/PDeepPP_Methylation-K)
-- [PDeepPP Methylation (Arginine)](https://huggingface.co/fondress/PDeepPP_Methylation-R)
-- [PDeepPP SUMOylation](https://huggingface.co/fondress/PDeepPP_SUMOylation)
-- [PDeepPP Ubiquitin](https://huggingface.co/fondress/PDeepPP_Ubiquitin)
-#### Bioactivity Prediction
-- [PDeepPP ACE](https://huggingface.co/fondress/PDeepPP_ACE)
-- [PDeepPP BBP](https://huggingface.co/fondress/PDeepPP_BBP)
-- [PDeepPP DPPIV](https://huggingface.co/fondress/PDeepPP_DPPIV)
-- [PDeepPP Toxicity](https://huggingface.co/fondress/PDeepPP_Toxicity)
-- [PDeepPP Antimalarial](https://huggingface.co/fondress/PDeepPP_Antimalarial-main)
-- [PDeepPP Anticancer](https://huggingface.co/fondress/PDeepPP_Anticancer-main)
-- [PDeepPP Antiviral](https://huggingface.co/fondress/PDeepPP_Antiviral)
-- [PDeepPP Antioxidant](https://huggingface.co/fondress/PDeepPP_Antioxidant)
-- [PDeepPP Antibacterial](https://huggingface.co/fondress/PDeepPP_Antibacterial)
-- [PDeepPP Antifungal](https://huggingface.co/fondress/PDeepPP_Antifungal)
-- [PDeepPP Bitter](https://huggingface.co/fondress/PDeepPP_bitter)
-- [PDeepPP Umami](https://huggingface.co/fondress/PDeepPP_umami)
-- [PDeepPP Quorum](https://huggingface.co/fondress/PDeepPP_Quorum)
-- [PDeepPP TTCA](https://huggingface.co/fondress/PDeepPP_TTCA)
----
-## Model Architecture
-PDeepPP is built on a hybrid architecture that includes:
-- **Self-Attention Global Features**: Captures long-range dependencies in protein sequences.
-- **TransConv1d Module**: Combines transformer layers with convolutional layers for local feature extraction.
-- **PosCNN Module**: Incorporates position-aware convolutional operations to enhance sequence representation.
----
-## How to Use
-To use any of the models, you need to install the required dependencies, such as `torch` and `transformers`:
 ```bash
 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 pip install transformers
 ```
-Here’s a quick example of how to load and use a model:
 ```python
-from transformers import AutoModel, AutoTokenizer
-# Load the model
 model_name = "fondress/PDeepPP_ACE"
-model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
-# Example input
-protein_sequence = "VELYP"
-# Preprocess the sequence (refer to specific model documentation for preprocessing steps)
-# Forward pass
-outputs = model(input_ids=processed_input)
-logits = outputs.logits
 ```
-## Training and Customization
-You can fine-tune PDeepPP for custom tasks using your own datasets. The model supports:
-- **Custom PTM types**: Extend the model to predict additional post-translational modifications.
-- **Sequence classification tasks**: Adapt the model to classify protein sequences based on custom labels.
-- **Feature extraction for downstream analyses**: Use PDeepPP to generate embeddings for tasks like clustering or similarity calculation.
-Refer to the `PDeepPPConfig` class in the source repository for details on available hyperparameters and customization options.
----
- ## Citation
- If you use any of the PDeepPP models in your research, please cite the associated paper or repository:
 ```
- @article{your_reference,
   title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
   author={Author Name},
   journal={Journal Name},

+---
+tags:
+- protein language model
+pipeline_tag: text-classification
+---
+# PDeepPP model
+`PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts.
+## Model description
+`PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of:
+1. A **Self-Attention Global Features module** for capturing long-range dependencies.
+2. A **TransConv1d module**, combining transformers and convolutional layers.
+3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction.
+The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows.
+## Intended uses
+`PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets:
+1. **PTM datasets**: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues.
+2. **BPS datasets**: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses.
+Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses.
+---
+### Key features
+- **Dataset support**: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions.
+- **Task flexibility**: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives.
+- **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity.
+- **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features.
+## How to use
+To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`:
 ```bash
 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 pip install transformers
 ```
+Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file.
+Here is an example of how to use PDeepPP to process protein sequences and obtain predictions:
 ```python
+import torch
+import esm
+from DataProcessor_pdeeppp import PDeepPPProcessor
+from Pretraining_pdeeppp import PretrainingPDeepPP
+from transformers import AutoModel
+# Global parameter settings
+device = torch.device("cpu")
+pad_char = "X"  # Padding character
+target_length = 33  # Target length for sequence padding
+mode = "BPS"  # Mode setting (only configured in example.py)
+esm_ratio = 1  # Ratio for ESM embeddings
+# Initialize the PDeepPPProcessor
+processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length)
+# Example protein sequences (test sequences)
+protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"]
+# Preprocess the sequences
+inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt")  # Dynamic mode parameter
+processed_sequences = inputs["raw_sequences"]
+# Load the ESM model
+esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D()
+esm_model = esm_model.to(device)
+esm_model.eval()
+# Initialize the PretrainingPDeepPP module
+pretrainer = PretrainingPDeepPP(
+    embedding_dim=1280,
+    target_length=target_length,
+    esm_ratio=esm_ratio,
+    device=device
+)
+# Extract the vocabulary and ensure the padding character 'X' is included
+vocab = set("".join(protein_sequences))
+vocab.add(pad_char)  # Add the padding character
+# Generate pretrained features using the PretrainingPDeepPP module
+pretrained_features = pretrainer.create_embeddings(
+    processed_sequences, vocab, esm_model, esm_alphabet
+)
+# Ensure pretrained features are on the same device
+inputs["input_embeds"] = pretrained_features.to(device)
+# Load the PDeepPP model
 model_name = "fondress/PDeepPP_ACE"
+model = AutoModel.from_pretrained(model_name, trust_remote_code=True)  # Directly load the model
+# Perform prediction
+model.eval()
+outputs = model(input_embeds=inputs["input_embeds"])  # Use pretrained features as model input
+logits = outputs["logits"]
+# Compute probability distributions and generate predictions
+softmax = torch.nn.Softmax(dim=-1)  # Apply softmax on the last dimension
+probabilities = softmax(logits)
+predicted_labels = (probabilities >= 0.5).long()
+# Print the prediction results for each sequence
+print("\nPrediction Results:")
+for i, seq in enumerate(processed_sequences):
+    print(f"Sequence: {seq}")
+    print(f"Probability: {probabilities[i].item():.4f}")
+    print(f"Predicted Label: {predicted_labels[i].item()}")
+    print("-" * 50)
 ```
+## Training and customization
+`PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as:
+- **Number of transformer layers**
+- **Hidden layer size**
+- **Dropout rate**
+- **PTM type** and other task-specific parameters
+Refer to `PDeepPPConfig` for details.
+## Citation
+If you use `PDeepPP` in your research, please cite the associated paper or repository:
 ```
+@article{your_reference,
   title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
   author={Author Name},
   journal={Journal Name},