kouroshSA commited on
Commit
e683105
·
verified ·
1 Parent(s): df2cc2b

Add YAML model-card front-matter to README

Browse files
Files changed (1) hide show
  1. README.md +74 -40
README.md CHANGED
@@ -1,30 +1,54 @@
1
- # ppiDCE
2
-
3
- A dual cross-encoder for binary protein-protein interaction (PPI) classification, built on ESM-1b ([Rives et al., 2021](https://doi.org/10.1073/pnas.2016239118)).
4
-
5
- ![ppiDCE Architecture](assets/ppiDCE.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ## Overview
8
 
9
- ppiDCE repurposes ESM-1b -- a single-sequence masked language model with no native PPI capability -- for protein-protein interaction prediction by exploiting its tokenizer's sentence-pair encoding mode. Both protein sequences are concatenated into a single input as `[CLS] Seq_A [SEP] Seq_B [EOS]`, enabling full bidirectional cross-attention between the two sequences at every transformer layer. The `[CLS]` token representation from the final layer captures joint inter-protein features and is passed through a dropout + linear classification head to produce binary interaction predictions with softmax probabilities.
 
 
10
 
11
- The model was developed for the *Prochlorococcus marinus* MED4 interactome, where it serves as one component of a tri-model consensus framework (alongside [ppiGPLM](https://github.com/kouroshSA/ppiGPLM) and [ppiBTEP](https://github.com/kouroshSA/ppiBTEP)) for computational PPI screening.
12
 
13
  ## Architecture
14
 
15
  | Parameter | Value |
16
  |-----------|-------|
17
  | Foundation | ESM-1b (facebook/esm1b_t33_650M_UR50S) |
18
- | Strategy | Cross-encoding (sentence-pair) |
19
- | Layers | 12 (configurable) |
20
- | Classification | [CLS] -> Dropout(0.1) -> Linear -> 2 |
21
  | Max sequence length | 1,024 tokens |
22
- | Optimizer | AdamW (lr = 2 x 10^-5) |
23
  | Loss | Cross-Entropy |
24
 
25
- ### Cross-Encoding vs Single-Sequence
26
 
27
- Unlike the original ESM-1b which processes one protein at a time, ppiDCE feeds both proteins as a single concatenated input. This enables inter-protein residue-residue attention at every transformer layer -- the most expressive strategy for modeling pairwise interactions, at the cost of O((n+m)^2) attention complexity.
 
 
 
 
 
 
28
 
29
  ## Installation
30
 
@@ -38,8 +62,8 @@ Unlike the original ESM-1b which processes one protein at a time, ppiDCE feeds b
38
 
39
  ```bash
40
  # Clone the repository
41
- git clone https://github.com/kouroshSA/ppiDCE.git
42
- cd ppiDCE
43
 
44
  # Create a conda environment
45
  conda create -n esm python=3.10
@@ -50,12 +74,12 @@ pip install -r requirements.txt
50
  ## Repository Structure
51
 
52
  ```
53
- ppiDCE/
54
- |-- train_ppiDCE.py # Training script
55
- |-- inference_ppiDCE.py # Batch inference script
56
  |-- roc_analysis_color_threshold_F1e.py # ROC curve analysis with F1 optimization
57
  |-- assets/
58
- | +-- ppiDCE.png # ASCII workflow diagram
59
  |-- requirements.txt
60
  |-- LICENSE
61
  +-- README.md
@@ -65,10 +89,10 @@ ppiDCE/
65
 
66
  ### Data Format
67
 
68
- Training and inference use CSV files with columns: `protein1_seq, protein2_seq, label`
69
 
70
- - `protein1_seq`, `protein2_seq`: Amino acid sequences
71
- - `label`: `0` (non-interacting) or `1` (interacting)
72
 
73
  For inference-only input, only the first two columns are required.
74
 
@@ -76,15 +100,15 @@ For inference-only input, only the first two columns are required.
76
 
77
  ```bash
78
  # Train from scratch with 12 layers
79
- python train_ppiDCE.py \
80
  --train_file train.csv \
81
  --val_file val.csv \
82
  --model_config facebook/esm1b_t33_650M_UR50S \
83
- --from_scratch \
84
  --num_layers 12 \
85
- --epochs 10 \
 
86
  --batch_size 2 \
87
- --learning_rate 2e-5 \
88
  --max_length 1024 \
89
  --output_dir ./out \
90
  --device cuda
@@ -92,21 +116,20 @@ python train_ppiDCE.py \
92
 
93
  #### Key training options
94
 
95
- - `--from_scratch`: Initialize the ESM backbone with random weights instead of
96
- loading pretrained ESM-1b. Useful when you suspect single-sequence
97
- pretraining priors are inappropriate for your task.
98
- - `--num_layers N`: Set total transformer layers when training from scratch
99
- - `--freeze_layers N`: Freeze bottom N layers during fine-tuning
100
- - `--add_layers N`: Append extra transformer layers on top
101
  - `--checkpoint path.pth`: Resume from a saved checkpoint
102
- - `--suppress_warnings`: Suppress tokenizer truncation warnings
 
 
103
 
104
  ### Inference
105
 
106
  ```bash
107
- python inference_ppiDCE.py \
108
- --model_path out/ppiDCE_epoch8.pth \
109
  --model_config facebook/esm1b_t33_650M_UR50S \
 
110
  --input_file test_pairs.csv \
111
  --output_file predictions.csv \
112
  --batch_size 4 \
@@ -114,7 +137,18 @@ python inference_ppiDCE.py \
114
  --device cuda
115
  ```
116
 
117
- Output CSV columns: `seq1, seq2, pred_label, prob_0, prob_1`
 
 
 
 
 
 
 
 
 
 
 
118
 
119
  ### ROC Analysis
120
 
@@ -130,11 +164,11 @@ The input CSV should have two columns: PRS (positive) and RRS (random/negative)
130
 
131
  ## Architecture Diagram
132
 
133
- The ASCII workflow diagram (`assets/ppiDCE.png`) covers:
134
- - **A.** Cross-encoding input strategy
135
- - **B.** Model architecture (ESM-1b backbone + classification head)
136
  - **C.** Training pipeline
137
- - **D.** Inference pipeline
138
 
139
  > Note: the diagram shows Softmax in the classification head for clarity, but
140
  > the implementation returns raw logits — softmax is applied implicitly by
 
1
+ ---
2
+ license: mit
3
+ library_name: pytorch
4
+ base_model: facebook/esm1b_t33_650M_UR50S
5
+ tags:
6
+ - protein-protein-interaction
7
+ - ppi
8
+ - protein-language-model
9
+ - esm
10
+ - esm-1b
11
+ - siamese
12
+ - bioinformatics
13
+ - biology
14
+ pipeline_tag: feature-extraction
15
+ ---
16
+
17
+ # ppiBTEP
18
+
19
+ A Siamese (twin-branch) protein-protein interaction classifier built on ESM-1b ([Rives et al., 2021](https://doi.org/10.1073/pnas.2016239118)). Also designated SiameseBTPE (BERT-Twin Protein Encoder).
20
+
21
+ ![ppiBTEP Architecture](assets/ppiBTEP.png)
22
 
23
  ## Overview
24
 
25
+ ppiBTEP processes each protein independently through a shared ESM-1b encoder -- no cross-sequence attention is used between the two proteins. Each branch extracts the `[CLS]` token embedding from the final transformer layer, the two embeddings are concatenated, and a dropout + linear classification head produces binary interaction predictions with softmax probabilities.
26
+
27
+ Unlike the cross-encoding approach (see [ppiDCE](https://github.com/kouroshSA/ppiDCE)), ppiBTEP must capture interaction-predictive features entirely from each protein's own sequence context. This makes it faster per pair and allows protein representations to be precomputed and reused, at the cost of not modeling direct inter-protein residue dependencies.
28
 
29
+ The model was developed for the *Prochlorococcus marinus* MED4 interactome, where it serves as one component of a tri-model consensus framework (alongside [ppiGPLM](https://github.com/kouroshSA/ppiGPLM) and [ppiDCE](https://github.com/kouroshSA/ppiDCE)) for computational PPI screening.
30
 
31
  ## Architecture
32
 
33
  | Parameter | Value |
34
  |-----------|-------|
35
  | Foundation | ESM-1b (facebook/esm1b_t33_650M_UR50S) |
36
+ | Strategy | Siamese / twin-branch |
37
+ | Layers | 12 default; 6, 8, 12, 16, or 18 selectable via --num_layers |
38
+ | Classification | Concat [CLS_A, CLS_B] -> Dropout(0.1) -> Linear -> 2 |
39
  | Max sequence length | 1,024 tokens |
40
+ | Optimizer | AdamW (lr = 1 x 10^-5) |
41
  | Loss | Cross-Entropy |
42
 
43
+ ### Siamese vs Cross-Encoder
44
 
45
+ | | ppiDCE (Cross-Encoder) | ppiBTEP (Siamese) |
46
+ |---|---|---|
47
+ | Input | `[CLS] Seq_A [SEP] Seq_B` (joint) | `[CLS] Seq_A` and `[CLS] Seq_B` (separate) |
48
+ | Cross-attention | Full bidirectional at every layer | None |
49
+ | Classification | Single [CLS] -> Linear | Concat [CLS_A, CLS_B] -> Linear |
50
+ | Complexity | O((n+m)^2) | O(n^2) + O(m^2) |
51
+ | Speed | Slower (joint encoding) | Faster (independent, reusable) |
52
 
53
  ## Installation
54
 
 
62
 
63
  ```bash
64
  # Clone the repository
65
+ git clone https://github.com/kouroshSA/ppiBTEP.git
66
+ cd ppiBTEP
67
 
68
  # Create a conda environment
69
  conda create -n esm python=3.10
 
74
  ## Repository Structure
75
 
76
  ```
77
+ ppiBTEP/
78
+ |-- train_ppiBTPE3b.py # Training script
79
+ |-- inference_ppiBTPE_2GPU.py # Batch inference script (multi-GPU)
80
  |-- roc_analysis_color_threshold_F1e.py # ROC curve analysis with F1 optimization
81
  |-- assets/
82
+ | +-- ppiBTEP.png # ASCII workflow diagram
83
  |-- requirements.txt
84
  |-- LICENSE
85
  +-- README.md
 
89
 
90
  ### Data Format
91
 
92
+ Training and inference use CSV files with columns: `seq1, seq2, label`
93
 
94
+ - `seq1`, `seq2`: Amino acid sequences
95
+ - `label`: `0` or `enemies` (non-interacting), `1` or `friends` (interacting)
96
 
97
  For inference-only input, only the first two columns are required.
98
 
 
100
 
101
  ```bash
102
  # Train from scratch with 12 layers
103
+ python train_ppiBTPE3b.py \
104
  --train_file train.csv \
105
  --val_file val.csv \
106
  --model_config facebook/esm1b_t33_650M_UR50S \
 
107
  --num_layers 12 \
108
+ --freeze_layers 0 \
109
+ --epochs 20 \
110
  --batch_size 2 \
111
+ --learning_rate 1e-5 \
112
  --max_length 1024 \
113
  --output_dir ./out \
114
  --device cuda
 
116
 
117
  #### Key training options
118
 
119
+ - `--num_layers N`: Total transformer layers (6, 8, 12, 16, or 18)
120
+ - `--freeze_layers N`: Freeze bottom N layers (use 0 for training from scratch)
 
 
 
 
121
  - `--checkpoint path.pth`: Resume from a saved checkpoint
122
+ - `--model_config`: ESM model config (default: `facebook/esm1b_t33_650M_UR50S`)
123
+
124
+ **Important:** When training from scratch, use `--freeze_layers 0` to ensure all layers (including embeddings) remain trainable. The default is 20, which would freeze most layers.
125
 
126
  ### Inference
127
 
128
  ```bash
129
+ python inference_ppiBTPE_2GPU.py \
130
+ --model_path out/ppiBTPE_epoch_17.pth \
131
  --model_config facebook/esm1b_t33_650M_UR50S \
132
+ --num_layers 12 \
133
  --input_file test_pairs.csv \
134
  --output_file predictions.csv \
135
  --batch_size 4 \
 
137
  --device cuda
138
  ```
139
 
140
+ Multi-GPU inference:
141
+ ```bash
142
+ python inference_ppiBTPE_2GPU.py \
143
+ --model_path out/ppiBTPE_final.pth \
144
+ --model_config facebook/esm1b_t33_650M_UR50S \
145
+ --num_layers 12 \
146
+ --input_file test_pairs.csv \
147
+ --output_file predictions.csv \
148
+ --device cuda:0,1
149
+ ```
150
+
151
+ Output CSV columns: `seq1, seq2, Prediction, Probability_Friends, Probability_Enemies`
152
 
153
  ### ROC Analysis
154
 
 
164
 
165
  ## Architecture Diagram
166
 
167
+ The ASCII workflow diagram (`assets/ppiBTEP.png`) covers:
168
+ - **A.** Siamese input strategy (independent per-protein encoding)
169
+ - **B.** Model architecture (twin ESM-1b branches + concat classification head)
170
  - **C.** Training pipeline
171
+ - **D.** Inference pipeline (multi-GPU)
172
 
173
  > Note: the diagram shows Softmax in the classification head for clarity, but
174
  > the implementation returns raw logits — softmax is applied implicitly by