File size: 6,522 Bytes

ff0d718
 
 
 
 
0d2ebd0
ff0d718
0d2ebd0
ff0d718
0d2ebd0
ff0d718
0d2ebd0
ff0d718
0d2ebd0
ff0d718
 
 
0d2ebd0
ff0d718
0d2ebd0
ff0d718
0d2ebd0
ff0d718
0d2ebd0
ff0d718
 
0d2ebd0
ff0d718
0d2ebd0
ff0d718
0d2ebd0
ff0d718
0d2ebd0
ff0d718
 
 
 
0d2ebd0
ff0d718
0d2ebd0
ff0d718
0d2ebd0
 
 
 
 
ff0d718
 
0d2ebd0
 
ff0d718
 
 
 
 
 
 
 
 
 
a3d239a
 
 
 
 
 
ff0d718
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0d2ebd0
 
ff0d718
0d2ebd0
ff0d718
0d2ebd0
ff0d718
 
 
 
0d2ebd0
ff0d718
0d2ebd0
ff0d718
 
0d2ebd0
 
ff0d718
0d2ebd0

---
tags:
- protein language model
pipeline_tag: text-classification
---

# PDeepPP model

`PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts.

## Model description

`PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of:

1. A **Self-Attention Global Features module** for capturing long-range dependencies.
2. A **TransConv1d module**, combining transformers and convolutional layers.
3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction.

The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows.

## Intended uses

`PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets:

1. **PTM datasets**: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues.
2. **BPS datasets**: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses.

Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses.

---

### Key features

- **Dataset support**: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions.
- **Task flexibility**: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives.
- **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity.
- **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features.

## How to use

To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`:

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers
```
Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file.
Here is an example of how to use PDeepPP to process protein sequences and obtain predictions:

```python
import torch
import esm
from DataProcessor_pdeeppp import PDeepPPProcessor
from Pretraining_pdeeppp import PretrainingPDeepPP
from transformers import AutoModel

# Global parameter settings
device = torch.device("cpu")
pad_char = "X"  # Padding character
target_length = 33  # Target length for sequence padding
mode = "PTMS"  # Mode setting (only configured in example.py)
esm_ratio = 0.95  # Ratio for ESM embeddings

# Load the PDeepPP model
model_name = "fondress/PDeepPP_N-linked-glycosylation-N"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)  # Directly load the model

# Initialize the PDeepPPProcessor
processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length)

# Example protein sequences (test sequences)
protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"]

# Preprocess the sequences
inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt")  # Dynamic mode parameter
processed_sequences = inputs["raw_sequences"]

# Load the ESM model
esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D()
esm_model = esm_model.to(device)
esm_model.eval()

# Initialize the PretrainingPDeepPP module
pretrainer = PretrainingPDeepPP(
    embedding_dim=1280, 
    target_length=target_length, 
    esm_ratio=esm_ratio, 
    device=device
)

# Extract the vocabulary and ensure the padding character 'X' is included
vocab = set("".join(protein_sequences))
vocab.add(pad_char)  # Add the padding character

# Generate pretrained features using the PretrainingPDeepPP module
pretrained_features = pretrainer.create_embeddings(
    processed_sequences, vocab, esm_model, esm_alphabet
)

# Ensure pretrained features are on the same device
inputs["input_embeds"] = pretrained_features.to(device)

# Perform prediction
model.eval()
outputs = model(input_embeds=inputs["input_embeds"])  # Use pretrained features as model input
logits = outputs["logits"]

# Compute probability distributions and generate predictions
softmax = torch.nn.Softmax(dim=-1)  # Apply softmax on the last dimension
probabilities = softmax(logits)
predicted_labels = (probabilities >= 0.5).long()

# Print the prediction results for each sequence
print("\nPrediction Results:")
for i, seq in enumerate(processed_sequences):
    print(f"Sequence: {seq}")
    print(f"Probability: {probabilities[i].item():.4f}")
    print(f"Predicted Label: {predicted_labels[i].item()}")
    print("-" * 50)
```

## Training and customization

`PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as:

- **Number of transformer layers**
- **Hidden layer size**
- **Dropout rate**
- **PTM type** and other task-specific parameters

Refer to `PDeepPPConfig` for details.

## Citation
If you use `PDeepPP` in your research, please cite the associated paper or repository:

```
@article{your_reference,
  title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
  author={Author Name},
  journal={Journal Name},
  year={2025}
}
```