fondress
/

PDeepPP_ACE

@@ -1,140 +1,110 @@
----
-tags:
-- protein language model
-pipeline_tag: text-classification
----
-# PDeepPP model
-`PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts.
-## Model description
-`PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of:
-1. A **Self-Attention Global Features module** for capturing long-range dependencies.
-2. A **TransConv1d module**, combining transformers and convolutional layers.
-3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction.
-The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows.
-## Intended uses
-`PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets:
-1. **PTM datasets**: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues.
-2. **BPS datasets**: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses.
-Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses.
----
-### Key features
-- **Dataset support**: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions.
-- **Task flexibility**: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives.
-- **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity.
-- **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features.
-## How to use
-To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`:
 ```bash
 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 pip install transformers
 ```
-Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file.
-Here is an example of how to use PDeepPP to process protein sequences and obtain predictions:
 ```python
-import torch
-import esm
-from DataProcessor_pdeeppp import PDeepPPProcessor
-from Pretraining_pdeeppp import PretrainingPDeepPP
-from transformers import AutoModel
-# Global parameter settings
-device = torch.device("cpu")
-pad_char = "X"  # Padding character
-target_length = 33  # Target length for sequence padding
-mode = "BPS"  # Mode setting (only configured in example.py)
-esm_ratio = 1  # Ratio for ESM embeddings
-# Initialize the PDeepPPProcessor
-processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length)
-# Example protein sequences (test sequences)
-protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"]
-# Preprocess the sequences
-inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt")  # Dynamic mode parameter
-processed_sequences = inputs["raw_sequences"]
-# Load the ESM model
-esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D()
-esm_model = esm_model.to(device)
-esm_model.eval()
-# Initialize the PretrainingPDeepPP module
-pretrainer = PretrainingPDeepPP(
-    embedding_dim=1280,
-    target_length=target_length,
-    esm_ratio=esm_ratio,
-    device=device
-)
-# Extract the vocabulary and ensure the padding character 'X' is included
-vocab = set("".join(protein_sequences))
-vocab.add(pad_char)  # Add the padding character
-# Generate pretrained features using the PretrainingPDeepPP module
-pretrained_features = pretrainer.create_embeddings(
-    processed_sequences, vocab, esm_model, esm_alphabet
-)
-# Ensure pretrained features are on the same device
-inputs["input_embeds"] = pretrained_features.to(device)
-# Load the PDeepPP model
 model_name = "fondress/PDeepPP_ACE"
-model = AutoModel.from_pretrained(model_name, trust_remote_code=True)  # Directly load the model
-# Perform prediction
-model.eval()
-outputs = model(input_embeds=inputs["input_embeds"])  # Use pretrained features as model input
-logits = outputs["logits"]
-# Compute probability distributions and generate predictions
-softmax = torch.nn.Softmax(dim=-1)  # Apply softmax on the last dimension
-probabilities = softmax(logits)
-predicted_labels = (probabilities >= 0.5).long()
-# Print the prediction results for each sequence
-print("\nPrediction Results:")
-for i, seq in enumerate(processed_sequences):
-    print(f"Sequence: {seq}")
-    print(f"Probability: {probabilities[i].item():.4f}")
-    print(f"Predicted Label: {predicted_labels[i].item()}")
-    print("-" * 50)
 ```
-## Training and customization
-`PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as:
-- **Number of transformer layers**
-- **Hidden layer size**
-- **Dropout rate**
-- **PTM type** and other task-specific parameters
-Refer to `PDeepPPConfig` for details.
-## Citation
-If you use `PDeepPP` in your research, please cite the associated paper or repository:
 ```
-@article{your_reference,
   title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
   author={Author Name},
   journal={Journal Name},

+# PDeepPP: A Comprehensive Protein Language Model Hub
+PDeepPP is a hybrid protein language model designed to predict post-translational modification (PTM) sites, analyze biologically relevant features, and support a wide range of protein sequence analysis tasks. This repository serves as the central hub for accessing and exploring various specialized PDeepPP models, each fine-tuned for specific tasks, such as PTM site prediction, bioactivity analysis, and more.
+## Overview
+PDeepPP integrates state-of-the-art transformer-based self-attention mechanisms with convolutional neural networks (CNNs) to capture both global and local features in protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating modular architecture components, PDeepPP offers a robust framework for protein sequence analysis.
+This repository contains links to multiple task-specific PDeepPP models. These models are pre-trained or fine-tuned on publicly available datasets and are hosted on Hugging Face for easy access.
+---
+## Key Features
+- **Flexible Architecture**: Combines self-attention and convolutional operations for robust feature extraction.
+- **Task-Specific Models**: Includes pre-trained models for PTM prediction, bioactivity classification, and more.
+- **Dataset Support**: Models are validated on datasets such as PTM and BPS, ensuring performance on real-world tasks.
+- **Extensibility**: Users can fine-tune the models on custom datasets for new tasks.
+---
+## Available Models
+### General Models
+- [PDeepPP Main](https://huggingface.co/fondress/PDeepPP)
+### Task-Specific Models
+#### Post-Translational Modifications (PTMs)
+- [PDeepPP Phosphorylation (Serine)](https://huggingface.co/fondress/PDeepPP_Phosphoserine)
+- [PDeepPP Phosphorylation (Tyrosine)](https://huggingface.co/fondress/PDeepPP_Phosphorylation-Y)
+- [PDeepPP Glycosylation (N-linked)](https://huggingface.co/fondress/PDeepPP_N-linked-glycosylation-N)
+- [PDeepPP Glycosylation (O-linked)](https://huggingface.co/fondress/PDeepPP_O-linked-glycosylation)
+- [PDeepPP Methylation (Lysine)](https://huggingface.co/fondress/PDeepPP_Methylation-K)
+- [PDeepPP Methylation (Arginine)](https://huggingface.co/fondress/PDeepPP_Methylation-R)
+- [PDeepPP SUMOylation](https://huggingface.co/fondress/PDeepPP_SUMOylation)
+- [PDeepPP Ubiquitin](https://huggingface.co/fondress/PDeepPP_Ubiquitin)
+#### Bioactivity Prediction
+- [PDeepPP ACE](https://huggingface.co/fondress/PDeepPP_ACE)
+- [PDeepPP BBP](https://huggingface.co/fondress/PDeepPP_BBP)
+- [PDeepPP DPPIV](https://huggingface.co/fondress/PDeepPP_DPPIV)
+- [PDeepPP Toxicity](https://huggingface.co/fondress/PDeepPP_Toxicity)
+- [PDeepPP Antimalarial](https://huggingface.co/fondress/PDeepPP_Antimalarial-main)
+- [PDeepPP Anticancer](https://huggingface.co/fondress/PDeepPP_Anticancer-main)
+- [PDeepPP Antiviral](https://huggingface.co/fondress/PDeepPP_Antiviral)
+- [PDeepPP Antioxidant](https://huggingface.co/fondress/PDeepPP_Antioxidant)
+- [PDeepPP Antibacterial](https://huggingface.co/fondress/PDeepPP_Antibacterial)
+- [PDeepPP Antifungal](https://huggingface.co/fondress/PDeepPP_Antifungal)
+- [PDeepPP Bitter](https://huggingface.co/fondress/PDeepPP_bitter)
+- [PDeepPP Umami](https://huggingface.co/fondress/PDeepPP_umami)
+- [PDeepPP Quorum](https://huggingface.co/fondress/PDeepPP_Quorum)
+- [PDeepPP TTCA](https://huggingface.co/fondress/PDeepPP_TTCA)
+---
+## Model Architecture
+PDeepPP is built on a hybrid architecture that includes:
+- **Self-Attention Global Features**: Captures long-range dependencies in protein sequences.
+- **TransConv1d Module**: Combines transformer layers with convolutional layers for local feature extraction.
+- **PosCNN Module**: Incorporates position-aware convolutional operations to enhance sequence representation.
+---
+## How to Use
+To use any of the models, you need to install the required dependencies, such as `torch` and `transformers`:
 ```bash
 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 pip install transformers
 ```
+Here’s a quick example of how to load and use a model:
 ```python
+from transformers import AutoModel, AutoTokenizer
+# Load the model
 model_name = "fondress/PDeepPP_ACE"
+model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
+# Example input
+protein_sequence = "VELYP"
+# Preprocess the sequence (refer to specific model documentation for preprocessing steps)
+# Forward pass
+outputs = model(input_ids=processed_input)
+logits = outputs.logits
 ```
+## Training and Customization
+You can fine-tune PDeepPP for custom tasks using your own datasets. The model supports:
+- **Custom PTM types**: Extend the model to predict additional post-translational modifications.
+- **Sequence classification tasks**: Adapt the model to classify protein sequences based on custom labels.
+- **Feature extraction for downstream analyses**: Use PDeepPP to generate embeddings for tasks like clustering or similarity calculation.
+Refer to the `PDeepPPConfig` class in the source repository for details on available hyperparameters and customization options.
+---
+ ## Citation
+ If you use any of the PDeepPP models in your research, please cite the associated paper or repository:
 ```
+ @article{your_reference,
   title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
   author={Author Name},
   journal={Journal Name},