fondress commited on
Commit
f7b4fb6
·
verified ·
1 Parent(s): 168643f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +76 -106
README.md CHANGED
@@ -1,140 +1,110 @@
1
- ---
2
- tags:
3
- - protein language model
4
- pipeline_tag: text-classification
5
- ---
6
 
7
- # PDeepPP model
8
 
9
- `PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts.
10
 
11
- ## Model description
12
 
13
- `PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of:
14
 
15
- 1. A **Self-Attention Global Features module** for capturing long-range dependencies.
16
- 2. A **TransConv1d module**, combining transformers and convolutional layers.
17
- 3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction.
18
 
19
- The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows.
20
 
21
- ## Intended uses
 
 
 
22
 
23
- `PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets:
24
 
25
- 1. **PTM datasets**: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues.
26
- 2. **BPS datasets**: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses.
29
 
30
- ---
31
 
32
- ### Key features
 
 
33
 
34
- - **Dataset support**: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions.
35
- - **Task flexibility**: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives.
36
- - **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity.
37
- - **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features.
38
 
39
- ## How to use
40
 
41
- To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`:
42
 
43
  ```bash
44
  pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
45
  pip install transformers
46
  ```
47
- Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file.
48
- Here is an example of how to use PDeepPP to process protein sequences and obtain predictions:
49
 
50
  ```python
51
- import torch
52
- import esm
53
- from DataProcessor_pdeeppp import PDeepPPProcessor
54
- from Pretraining_pdeeppp import PretrainingPDeepPP
55
- from transformers import AutoModel
56
-
57
- # Global parameter settings
58
- device = torch.device("cpu")
59
- pad_char = "X" # Padding character
60
- target_length = 33 # Target length for sequence padding
61
- mode = "BPS" # Mode setting (only configured in example.py)
62
- esm_ratio = 1 # Ratio for ESM embeddings
63
-
64
- # Initialize the PDeepPPProcessor
65
- processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length)
66
-
67
- # Example protein sequences (test sequences)
68
- protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"]
69
-
70
- # Preprocess the sequences
71
- inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt") # Dynamic mode parameter
72
- processed_sequences = inputs["raw_sequences"]
73
-
74
- # Load the ESM model
75
- esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D()
76
- esm_model = esm_model.to(device)
77
- esm_model.eval()
78
-
79
- # Initialize the PretrainingPDeepPP module
80
- pretrainer = PretrainingPDeepPP(
81
- embedding_dim=1280,
82
- target_length=target_length,
83
- esm_ratio=esm_ratio,
84
- device=device
85
- )
86
-
87
- # Extract the vocabulary and ensure the padding character 'X' is included
88
- vocab = set("".join(protein_sequences))
89
- vocab.add(pad_char) # Add the padding character
90
-
91
- # Generate pretrained features using the PretrainingPDeepPP module
92
- pretrained_features = pretrainer.create_embeddings(
93
- processed_sequences, vocab, esm_model, esm_alphabet
94
- )
95
-
96
- # Ensure pretrained features are on the same device
97
- inputs["input_embeds"] = pretrained_features.to(device)
98
-
99
- # Load the PDeepPP model
100
  model_name = "fondress/PDeepPP_ACE"
101
- model = AutoModel.from_pretrained(model_name, trust_remote_code=True) # Directly load the model
102
-
103
- # Perform prediction
104
- model.eval()
105
- outputs = model(input_embeds=inputs["input_embeds"]) # Use pretrained features as model input
106
- logits = outputs["logits"]
107
-
108
- # Compute probability distributions and generate predictions
109
- softmax = torch.nn.Softmax(dim=-1) # Apply softmax on the last dimension
110
- probabilities = softmax(logits)
111
- predicted_labels = (probabilities >= 0.5).long()
112
-
113
- # Print the prediction results for each sequence
114
- print("\nPrediction Results:")
115
- for i, seq in enumerate(processed_sequences):
116
- print(f"Sequence: {seq}")
117
- print(f"Probability: {probabilities[i].item():.4f}")
118
- print(f"Predicted Label: {predicted_labels[i].item()}")
119
- print("-" * 50)
120
  ```
121
 
122
- ## Training and customization
123
 
124
- `PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as:
125
 
126
- - **Number of transformer layers**
127
- - **Hidden layer size**
128
- - **Dropout rate**
129
- - **PTM type** and other task-specific parameters
130
 
131
- Refer to `PDeepPPConfig` for details.
132
 
133
- ## Citation
134
- If you use `PDeepPP` in your research, please cite the associated paper or repository:
 
135
 
136
  ```
137
- @article{your_reference,
138
  title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
139
  author={Author Name},
140
  journal={Journal Name},
 
1
+ # PDeepPP: A Comprehensive Protein Language Model Hub
 
 
 
 
2
 
3
+ PDeepPP is a hybrid protein language model designed to predict post-translational modification (PTM) sites, analyze biologically relevant features, and support a wide range of protein sequence analysis tasks. This repository serves as the central hub for accessing and exploring various specialized PDeepPP models, each fine-tuned for specific tasks, such as PTM site prediction, bioactivity analysis, and more.
4
 
5
+ ## Overview
6
 
7
+ PDeepPP integrates state-of-the-art transformer-based self-attention mechanisms with convolutional neural networks (CNNs) to capture both global and local features in protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating modular architecture components, PDeepPP offers a robust framework for protein sequence analysis.
8
 
9
+ This repository contains links to multiple task-specific PDeepPP models. These models are pre-trained or fine-tuned on publicly available datasets and are hosted on Hugging Face for easy access.
10
 
11
+ ---
 
 
12
 
13
+ ## Key Features
14
 
15
+ - **Flexible Architecture**: Combines self-attention and convolutional operations for robust feature extraction.
16
+ - **Task-Specific Models**: Includes pre-trained models for PTM prediction, bioactivity classification, and more.
17
+ - **Dataset Support**: Models are validated on datasets such as PTM and BPS, ensuring performance on real-world tasks.
18
+ - **Extensibility**: Users can fine-tune the models on custom datasets for new tasks.
19
 
20
+ ---
21
 
22
+ ## Available Models
23
+
24
+ ### General Models
25
+ - [PDeepPP Main](https://huggingface.co/fondress/PDeepPP)
26
+
27
+ ### Task-Specific Models
28
+
29
+ #### Post-Translational Modifications (PTMs)
30
+ - [PDeepPP Phosphorylation (Serine)](https://huggingface.co/fondress/PDeepPP_Phosphoserine)
31
+ - [PDeepPP Phosphorylation (Tyrosine)](https://huggingface.co/fondress/PDeepPP_Phosphorylation-Y)
32
+ - [PDeepPP Glycosylation (N-linked)](https://huggingface.co/fondress/PDeepPP_N-linked-glycosylation-N)
33
+ - [PDeepPP Glycosylation (O-linked)](https://huggingface.co/fondress/PDeepPP_O-linked-glycosylation)
34
+ - [PDeepPP Methylation (Lysine)](https://huggingface.co/fondress/PDeepPP_Methylation-K)
35
+ - [PDeepPP Methylation (Arginine)](https://huggingface.co/fondress/PDeepPP_Methylation-R)
36
+ - [PDeepPP SUMOylation](https://huggingface.co/fondress/PDeepPP_SUMOylation)
37
+ - [PDeepPP Ubiquitin](https://huggingface.co/fondress/PDeepPP_Ubiquitin)
38
+
39
+ #### Bioactivity Prediction
40
+ - [PDeepPP ACE](https://huggingface.co/fondress/PDeepPP_ACE)
41
+ - [PDeepPP BBP](https://huggingface.co/fondress/PDeepPP_BBP)
42
+ - [PDeepPP DPPIV](https://huggingface.co/fondress/PDeepPP_DPPIV)
43
+ - [PDeepPP Toxicity](https://huggingface.co/fondress/PDeepPP_Toxicity)
44
+ - [PDeepPP Antimalarial](https://huggingface.co/fondress/PDeepPP_Antimalarial-main)
45
+ - [PDeepPP Anticancer](https://huggingface.co/fondress/PDeepPP_Anticancer-main)
46
+ - [PDeepPP Antiviral](https://huggingface.co/fondress/PDeepPP_Antiviral)
47
+ - [PDeepPP Antioxidant](https://huggingface.co/fondress/PDeepPP_Antioxidant)
48
+ - [PDeepPP Antibacterial](https://huggingface.co/fondress/PDeepPP_Antibacterial)
49
+ - [PDeepPP Antifungal](https://huggingface.co/fondress/PDeepPP_Antifungal)
50
+ - [PDeepPP Bitter](https://huggingface.co/fondress/PDeepPP_bitter)
51
+ - [PDeepPP Umami](https://huggingface.co/fondress/PDeepPP_umami)
52
+ - [PDeepPP Quorum](https://huggingface.co/fondress/PDeepPP_Quorum)
53
+ - [PDeepPP TTCA](https://huggingface.co/fondress/PDeepPP_TTCA)
54
+ ---
55
 
56
+ ## Model Architecture
57
 
58
+ PDeepPP is built on a hybrid architecture that includes:
59
 
60
+ - **Self-Attention Global Features**: Captures long-range dependencies in protein sequences.
61
+ - **TransConv1d Module**: Combines transformer layers with convolutional layers for local feature extraction.
62
+ - **PosCNN Module**: Incorporates position-aware convolutional operations to enhance sequence representation.
63
 
64
+ ---
 
 
 
65
 
66
+ ## How to Use
67
 
68
+ To use any of the models, you need to install the required dependencies, such as `torch` and `transformers`:
69
 
70
  ```bash
71
  pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
72
  pip install transformers
73
  ```
74
+ Here’s a quick example of how to load and use a model:
 
75
 
76
  ```python
77
+ from transformers import AutoModel, AutoTokenizer
78
+
79
+ # Load the model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  model_name = "fondress/PDeepPP_ACE"
81
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
82
+
83
+ # Example input
84
+ protein_sequence = "VELYP"
85
+ # Preprocess the sequence (refer to specific model documentation for preprocessing steps)
86
+
87
+ # Forward pass
88
+ outputs = model(input_ids=processed_input)
89
+ logits = outputs.logits
 
 
 
 
 
 
 
 
 
 
90
  ```
91
 
92
+ ## Training and Customization
93
 
94
+ You can fine-tune PDeepPP for custom tasks using your own datasets. The model supports:
95
 
96
+ - **Custom PTM types**: Extend the model to predict additional post-translational modifications.
97
+ - **Sequence classification tasks**: Adapt the model to classify protein sequences based on custom labels.
98
+ - **Feature extraction for downstream analyses**: Use PDeepPP to generate embeddings for tasks like clustering or similarity calculation.
 
99
 
100
+ Refer to the `PDeepPPConfig` class in the source repository for details on available hyperparameters and customization options.
101
 
102
+ ---
103
+ ## Citation
104
+ If you use any of the PDeepPP models in your research, please cite the associated paper or repository:
105
 
106
  ```
107
+ @article{your_reference,
108
  title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
109
  author={Author Name},
110
  journal={Journal Name},