fondress commited on
Commit
ff0d718
·
verified ·
1 Parent(s): 0d2ebd0

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +106 -76
README.md CHANGED
@@ -1,110 +1,140 @@
1
- # PDeepPP: A Comprehensive Protein Language Model Hub
 
 
 
 
2
 
3
- PDeepPP is a hybrid protein language model designed to predict post-translational modification (PTM) sites, analyze biologically relevant features, and support a wide range of protein sequence analysis tasks. This repository serves as the central hub for accessing and exploring various specialized PDeepPP models, each fine-tuned for specific tasks, such as PTM site prediction, bioactivity analysis, and more.
4
 
5
- ## Overview
6
 
7
- PDeepPP integrates state-of-the-art transformer-based self-attention mechanisms with convolutional neural networks (CNNs) to capture both global and local features in protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating modular architecture components, PDeepPP offers a robust framework for protein sequence analysis.
8
 
9
- This repository contains links to multiple task-specific PDeepPP models. These models are pre-trained or fine-tuned on publicly available datasets and are hosted on Hugging Face for easy access.
10
 
11
- ---
 
 
12
 
13
- ## Key Features
14
 
15
- - **Flexible Architecture**: Combines self-attention and convolutional operations for robust feature extraction.
16
- - **Task-Specific Models**: Includes pre-trained models for PTM prediction, bioactivity classification, and more.
17
- - **Dataset Support**: Models are validated on datasets such as PTM and BPS, ensuring performance on real-world tasks.
18
- - **Extensibility**: Users can fine-tune the models on custom datasets for new tasks.
19
 
20
- ---
21
 
22
- ## Available Models
23
-
24
- ### General Models
25
- - [PDeepPP Main](https://huggingface.co/fondress/PDeepPP)
26
-
27
- ### Task-Specific Models
28
-
29
- #### Post-Translational Modifications (PTMs)
30
- - [PDeepPP Phosphorylation (Serine)](https://huggingface.co/fondress/PDeepPP_Phosphoserine)
31
- - [PDeepPP Phosphorylation (Tyrosine)](https://huggingface.co/fondress/PDeepPP_Phosphorylation-Y)
32
- - [PDeepPP Glycosylation (N-linked)](https://huggingface.co/fondress/PDeepPP_N-linked-glycosylation-N)
33
- - [PDeepPP Glycosylation (O-linked)](https://huggingface.co/fondress/PDeepPP_O-linked-glycosylation)
34
- - [PDeepPP Methylation (Lysine)](https://huggingface.co/fondress/PDeepPP_Methylation-K)
35
- - [PDeepPP Methylation (Arginine)](https://huggingface.co/fondress/PDeepPP_Methylation-R)
36
- - [PDeepPP SUMOylation](https://huggingface.co/fondress/PDeepPP_SUMOylation)
37
- - [PDeepPP Ubiquitin](https://huggingface.co/fondress/PDeepPP_Ubiquitin)
38
-
39
- #### Bioactivity Prediction
40
- - [PDeepPP ACE](https://huggingface.co/fondress/PDeepPP_ACE)
41
- - [PDeepPP BBP](https://huggingface.co/fondress/PDeepPP_BBP)
42
- - [PDeepPP DPPIV](https://huggingface.co/fondress/PDeepPP_DPPIV)
43
- - [PDeepPP Toxicity](https://huggingface.co/fondress/PDeepPP_Toxicity)
44
- - [PDeepPP Antimalarial](https://huggingface.co/fondress/PDeepPP_Antimalarial-main)
45
- - [PDeepPP Anticancer](https://huggingface.co/fondress/PDeepPP_Anticancer-main)
46
- - [PDeepPP Antiviral](https://huggingface.co/fondress/PDeepPP_Antiviral)
47
- - [PDeepPP Antioxidant](https://huggingface.co/fondress/PDeepPP_Antioxidant)
48
- - [PDeepPP Antibacterial](https://huggingface.co/fondress/PDeepPP_Antibacterial)
49
- - [PDeepPP Antifungal](https://huggingface.co/fondress/PDeepPP_Antifungal)
50
- - [PDeepPP Bitter](https://huggingface.co/fondress/PDeepPP_bitter)
51
- - [PDeepPP Umami](https://huggingface.co/fondress/PDeepPP_umami)
52
- - [PDeepPP Quorum](https://huggingface.co/fondress/PDeepPP_Quorum)
53
- - [PDeepPP TTCA](https://huggingface.co/fondress/PDeepPP_TTCA)
54
- ---
55
 
56
- ## Model Architecture
57
 
58
- PDeepPP is built on a hybrid architecture that includes:
59
 
60
- - **Self-Attention Global Features**: Captures long-range dependencies in protein sequences.
61
- - **TransConv1d Module**: Combines transformer layers with convolutional layers for local feature extraction.
62
- - **PosCNN Module**: Incorporates position-aware convolutional operations to enhance sequence representation.
63
 
64
- ---
 
 
 
65
 
66
- ## How to Use
67
 
68
- To use any of the models, you need to install the required dependencies, such as `torch` and `transformers`:
69
 
70
  ```bash
71
  pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
72
  pip install transformers
73
  ```
74
- Here’s a quick example of how to load and use a model:
 
75
 
76
  ```python
77
- from transformers import AutoModel, AutoTokenizer
78
-
79
- # Load the model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  model_name = "fondress/PDeepPP_ACE"
81
- model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
82
-
83
- # Example input
84
- protein_sequence = "VELYP"
85
- # Preprocess the sequence (refer to specific model documentation for preprocessing steps)
86
-
87
- # Forward pass
88
- outputs = model(input_ids=processed_input)
89
- logits = outputs.logits
 
 
 
 
 
 
 
 
 
 
90
  ```
91
 
92
- ## Training and Customization
93
 
94
- You can fine-tune PDeepPP for custom tasks using your own datasets. The model supports:
95
 
96
- - **Custom PTM types**: Extend the model to predict additional post-translational modifications.
97
- - **Sequence classification tasks**: Adapt the model to classify protein sequences based on custom labels.
98
- - **Feature extraction for downstream analyses**: Use PDeepPP to generate embeddings for tasks like clustering or similarity calculation.
 
99
 
100
- Refer to the `PDeepPPConfig` class in the source repository for details on available hyperparameters and customization options.
101
 
102
- ---
103
- ## Citation
104
- If you use any of the PDeepPP models in your research, please cite the associated paper or repository:
105
 
106
  ```
107
- @article{your_reference,
108
  title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
109
  author={Author Name},
110
  journal={Journal Name},
 
1
+ ---
2
+ tags:
3
+ - protein language model
4
+ pipeline_tag: text-classification
5
+ ---
6
 
7
+ # PDeepPP model
8
 
9
+ `PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts.
10
 
11
+ ## Model description
12
 
13
+ `PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of:
14
 
15
+ 1. A **Self-Attention Global Features module** for capturing long-range dependencies.
16
+ 2. A **TransConv1d module**, combining transformers and convolutional layers.
17
+ 3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction.
18
 
19
+ The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows.
20
 
21
+ ## Intended uses
 
 
 
22
 
23
+ `PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets:
24
 
25
+ 1. **PTM datasets**: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues.
26
+ 2. **BPS datasets**: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses.
29
 
30
+ ---
31
 
32
+ ### Key features
 
 
33
 
34
+ - **Dataset support**: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions.
35
+ - **Task flexibility**: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives.
36
+ - **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity.
37
+ - **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features.
38
 
39
+ ## How to use
40
 
41
+ To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`:
42
 
43
  ```bash
44
  pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
45
  pip install transformers
46
  ```
47
+ Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file.
48
+ Here is an example of how to use PDeepPP to process protein sequences and obtain predictions:
49
 
50
  ```python
51
+ import torch
52
+ import esm
53
+ from DataProcessor_pdeeppp import PDeepPPProcessor
54
+ from Pretraining_pdeeppp import PretrainingPDeepPP
55
+ from transformers import AutoModel
56
+
57
+ # Global parameter settings
58
+ device = torch.device("cpu")
59
+ pad_char = "X" # Padding character
60
+ target_length = 33 # Target length for sequence padding
61
+ mode = "BPS" # Mode setting (only configured in example.py)
62
+ esm_ratio = 1 # Ratio for ESM embeddings
63
+
64
+ # Initialize the PDeepPPProcessor
65
+ processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length)
66
+
67
+ # Example protein sequences (test sequences)
68
+ protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"]
69
+
70
+ # Preprocess the sequences
71
+ inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt") # Dynamic mode parameter
72
+ processed_sequences = inputs["raw_sequences"]
73
+
74
+ # Load the ESM model
75
+ esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D()
76
+ esm_model = esm_model.to(device)
77
+ esm_model.eval()
78
+
79
+ # Initialize the PretrainingPDeepPP module
80
+ pretrainer = PretrainingPDeepPP(
81
+ embedding_dim=1280,
82
+ target_length=target_length,
83
+ esm_ratio=esm_ratio,
84
+ device=device
85
+ )
86
+
87
+ # Extract the vocabulary and ensure the padding character 'X' is included
88
+ vocab = set("".join(protein_sequences))
89
+ vocab.add(pad_char) # Add the padding character
90
+
91
+ # Generate pretrained features using the PretrainingPDeepPP module
92
+ pretrained_features = pretrainer.create_embeddings(
93
+ processed_sequences, vocab, esm_model, esm_alphabet
94
+ )
95
+
96
+ # Ensure pretrained features are on the same device
97
+ inputs["input_embeds"] = pretrained_features.to(device)
98
+
99
+ # Load the PDeepPP model
100
  model_name = "fondress/PDeepPP_ACE"
101
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True) # Directly load the model
102
+
103
+ # Perform prediction
104
+ model.eval()
105
+ outputs = model(input_embeds=inputs["input_embeds"]) # Use pretrained features as model input
106
+ logits = outputs["logits"]
107
+
108
+ # Compute probability distributions and generate predictions
109
+ softmax = torch.nn.Softmax(dim=-1) # Apply softmax on the last dimension
110
+ probabilities = softmax(logits)
111
+ predicted_labels = (probabilities >= 0.5).long()
112
+
113
+ # Print the prediction results for each sequence
114
+ print("\nPrediction Results:")
115
+ for i, seq in enumerate(processed_sequences):
116
+ print(f"Sequence: {seq}")
117
+ print(f"Probability: {probabilities[i].item():.4f}")
118
+ print(f"Predicted Label: {predicted_labels[i].item()}")
119
+ print("-" * 50)
120
  ```
121
 
122
+ ## Training and customization
123
 
124
+ `PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as:
125
 
126
+ - **Number of transformer layers**
127
+ - **Hidden layer size**
128
+ - **Dropout rate**
129
+ - **PTM type** and other task-specific parameters
130
 
131
+ Refer to `PDeepPPConfig` for details.
132
 
133
+ ## Citation
134
+ If you use `PDeepPP` in your research, please cite the associated paper or repository:
 
135
 
136
  ```
137
+ @article{your_reference,
138
  title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
139
  author={Author Name},
140
  journal={Journal Name},