Add YAML model-card front-matter to README
Browse files
README.md
CHANGED
|
@@ -1,30 +1,54 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
## Overview
|
| 8 |
|
| 9 |
-
|
|
|
|
|
|
|
| 10 |
|
| 11 |
-
The model was developed for the *Prochlorococcus marinus* MED4 interactome, where it serves as one component of a tri-model consensus framework (alongside [ppiGPLM](https://github.com/kouroshSA/ppiGPLM) and [
|
| 12 |
|
| 13 |
## Architecture
|
| 14 |
|
| 15 |
| Parameter | Value |
|
| 16 |
|-----------|-------|
|
| 17 |
| Foundation | ESM-1b (facebook/esm1b_t33_650M_UR50S) |
|
| 18 |
-
| Strategy |
|
| 19 |
-
| Layers | 12
|
| 20 |
-
| Classification | [
|
| 21 |
| Max sequence length | 1,024 tokens |
|
| 22 |
-
| Optimizer | AdamW (lr =
|
| 23 |
| Loss | Cross-Entropy |
|
| 24 |
|
| 25 |
-
###
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
## Installation
|
| 30 |
|
|
@@ -38,8 +62,8 @@ Unlike the original ESM-1b which processes one protein at a time, ppiDCE feeds b
|
|
| 38 |
|
| 39 |
```bash
|
| 40 |
# Clone the repository
|
| 41 |
-
git clone https://github.com/kouroshSA/
|
| 42 |
-
cd
|
| 43 |
|
| 44 |
# Create a conda environment
|
| 45 |
conda create -n esm python=3.10
|
|
@@ -50,12 +74,12 @@ pip install -r requirements.txt
|
|
| 50 |
## Repository Structure
|
| 51 |
|
| 52 |
```
|
| 53 |
-
|
| 54 |
-
|--
|
| 55 |
-
|--
|
| 56 |
|-- roc_analysis_color_threshold_F1e.py # ROC curve analysis with F1 optimization
|
| 57 |
|-- assets/
|
| 58 |
-
| +--
|
| 59 |
|-- requirements.txt
|
| 60 |
|-- LICENSE
|
| 61 |
+-- README.md
|
|
@@ -65,10 +89,10 @@ ppiDCE/
|
|
| 65 |
|
| 66 |
### Data Format
|
| 67 |
|
| 68 |
-
Training and inference use CSV files with columns: `
|
| 69 |
|
| 70 |
-
- `
|
| 71 |
-
- `label`: `0` (non-interacting) or `
|
| 72 |
|
| 73 |
For inference-only input, only the first two columns are required.
|
| 74 |
|
|
@@ -76,15 +100,15 @@ For inference-only input, only the first two columns are required.
|
|
| 76 |
|
| 77 |
```bash
|
| 78 |
# Train from scratch with 12 layers
|
| 79 |
-
python
|
| 80 |
--train_file train.csv \
|
| 81 |
--val_file val.csv \
|
| 82 |
--model_config facebook/esm1b_t33_650M_UR50S \
|
| 83 |
-
--from_scratch \
|
| 84 |
--num_layers 12 \
|
| 85 |
-
--
|
|
|
|
| 86 |
--batch_size 2 \
|
| 87 |
-
--learning_rate
|
| 88 |
--max_length 1024 \
|
| 89 |
--output_dir ./out \
|
| 90 |
--device cuda
|
|
@@ -92,21 +116,20 @@ python train_ppiDCE.py \
|
|
| 92 |
|
| 93 |
#### Key training options
|
| 94 |
|
| 95 |
-
- `--
|
| 96 |
-
|
| 97 |
-
pretraining priors are inappropriate for your task.
|
| 98 |
-
- `--num_layers N`: Set total transformer layers when training from scratch
|
| 99 |
-
- `--freeze_layers N`: Freeze bottom N layers during fine-tuning
|
| 100 |
-
- `--add_layers N`: Append extra transformer layers on top
|
| 101 |
- `--checkpoint path.pth`: Resume from a saved checkpoint
|
| 102 |
-
- `--
|
|
|
|
|
|
|
| 103 |
|
| 104 |
### Inference
|
| 105 |
|
| 106 |
```bash
|
| 107 |
-
python
|
| 108 |
-
--model_path out/
|
| 109 |
--model_config facebook/esm1b_t33_650M_UR50S \
|
|
|
|
| 110 |
--input_file test_pairs.csv \
|
| 111 |
--output_file predictions.csv \
|
| 112 |
--batch_size 4 \
|
|
@@ -114,7 +137,18 @@ python inference_ppiDCE.py \
|
|
| 114 |
--device cuda
|
| 115 |
```
|
| 116 |
|
| 117 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
### ROC Analysis
|
| 120 |
|
|
@@ -130,11 +164,11 @@ The input CSV should have two columns: PRS (positive) and RRS (random/negative)
|
|
| 130 |
|
| 131 |
## Architecture Diagram
|
| 132 |
|
| 133 |
-
The ASCII workflow diagram (`assets/
|
| 134 |
-
- **A.**
|
| 135 |
-
- **B.** Model architecture (ESM-1b
|
| 136 |
- **C.** Training pipeline
|
| 137 |
-
- **D.** Inference pipeline
|
| 138 |
|
| 139 |
> Note: the diagram shows Softmax in the classification head for clarity, but
|
| 140 |
> the implementation returns raw logits — softmax is applied implicitly by
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: pytorch
|
| 4 |
+
base_model: facebook/esm1b_t33_650M_UR50S
|
| 5 |
+
tags:
|
| 6 |
+
- protein-protein-interaction
|
| 7 |
+
- ppi
|
| 8 |
+
- protein-language-model
|
| 9 |
+
- esm
|
| 10 |
+
- esm-1b
|
| 11 |
+
- siamese
|
| 12 |
+
- bioinformatics
|
| 13 |
+
- biology
|
| 14 |
+
pipeline_tag: feature-extraction
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# ppiBTEP
|
| 18 |
+
|
| 19 |
+
A Siamese (twin-branch) protein-protein interaction classifier built on ESM-1b ([Rives et al., 2021](https://doi.org/10.1073/pnas.2016239118)). Also designated SiameseBTPE (BERT-Twin Protein Encoder).
|
| 20 |
+
|
| 21 |
+

|
| 22 |
|
| 23 |
## Overview
|
| 24 |
|
| 25 |
+
ppiBTEP processes each protein independently through a shared ESM-1b encoder -- no cross-sequence attention is used between the two proteins. Each branch extracts the `[CLS]` token embedding from the final transformer layer, the two embeddings are concatenated, and a dropout + linear classification head produces binary interaction predictions with softmax probabilities.
|
| 26 |
+
|
| 27 |
+
Unlike the cross-encoding approach (see [ppiDCE](https://github.com/kouroshSA/ppiDCE)), ppiBTEP must capture interaction-predictive features entirely from each protein's own sequence context. This makes it faster per pair and allows protein representations to be precomputed and reused, at the cost of not modeling direct inter-protein residue dependencies.
|
| 28 |
|
| 29 |
+
The model was developed for the *Prochlorococcus marinus* MED4 interactome, where it serves as one component of a tri-model consensus framework (alongside [ppiGPLM](https://github.com/kouroshSA/ppiGPLM) and [ppiDCE](https://github.com/kouroshSA/ppiDCE)) for computational PPI screening.
|
| 30 |
|
| 31 |
## Architecture
|
| 32 |
|
| 33 |
| Parameter | Value |
|
| 34 |
|-----------|-------|
|
| 35 |
| Foundation | ESM-1b (facebook/esm1b_t33_650M_UR50S) |
|
| 36 |
+
| Strategy | Siamese / twin-branch |
|
| 37 |
+
| Layers | 12 default; 6, 8, 12, 16, or 18 selectable via --num_layers |
|
| 38 |
+
| Classification | Concat [CLS_A, CLS_B] -> Dropout(0.1) -> Linear -> 2 |
|
| 39 |
| Max sequence length | 1,024 tokens |
|
| 40 |
+
| Optimizer | AdamW (lr = 1 x 10^-5) |
|
| 41 |
| Loss | Cross-Entropy |
|
| 42 |
|
| 43 |
+
### Siamese vs Cross-Encoder
|
| 44 |
|
| 45 |
+
| | ppiDCE (Cross-Encoder) | ppiBTEP (Siamese) |
|
| 46 |
+
|---|---|---|
|
| 47 |
+
| Input | `[CLS] Seq_A [SEP] Seq_B` (joint) | `[CLS] Seq_A` and `[CLS] Seq_B` (separate) |
|
| 48 |
+
| Cross-attention | Full bidirectional at every layer | None |
|
| 49 |
+
| Classification | Single [CLS] -> Linear | Concat [CLS_A, CLS_B] -> Linear |
|
| 50 |
+
| Complexity | O((n+m)^2) | O(n^2) + O(m^2) |
|
| 51 |
+
| Speed | Slower (joint encoding) | Faster (independent, reusable) |
|
| 52 |
|
| 53 |
## Installation
|
| 54 |
|
|
|
|
| 62 |
|
| 63 |
```bash
|
| 64 |
# Clone the repository
|
| 65 |
+
git clone https://github.com/kouroshSA/ppiBTEP.git
|
| 66 |
+
cd ppiBTEP
|
| 67 |
|
| 68 |
# Create a conda environment
|
| 69 |
conda create -n esm python=3.10
|
|
|
|
| 74 |
## Repository Structure
|
| 75 |
|
| 76 |
```
|
| 77 |
+
ppiBTEP/
|
| 78 |
+
|-- train_ppiBTPE3b.py # Training script
|
| 79 |
+
|-- inference_ppiBTPE_2GPU.py # Batch inference script (multi-GPU)
|
| 80 |
|-- roc_analysis_color_threshold_F1e.py # ROC curve analysis with F1 optimization
|
| 81 |
|-- assets/
|
| 82 |
+
| +-- ppiBTEP.png # ASCII workflow diagram
|
| 83 |
|-- requirements.txt
|
| 84 |
|-- LICENSE
|
| 85 |
+-- README.md
|
|
|
|
| 89 |
|
| 90 |
### Data Format
|
| 91 |
|
| 92 |
+
Training and inference use CSV files with columns: `seq1, seq2, label`
|
| 93 |
|
| 94 |
+
- `seq1`, `seq2`: Amino acid sequences
|
| 95 |
+
- `label`: `0` or `enemies` (non-interacting), `1` or `friends` (interacting)
|
| 96 |
|
| 97 |
For inference-only input, only the first two columns are required.
|
| 98 |
|
|
|
|
| 100 |
|
| 101 |
```bash
|
| 102 |
# Train from scratch with 12 layers
|
| 103 |
+
python train_ppiBTPE3b.py \
|
| 104 |
--train_file train.csv \
|
| 105 |
--val_file val.csv \
|
| 106 |
--model_config facebook/esm1b_t33_650M_UR50S \
|
|
|
|
| 107 |
--num_layers 12 \
|
| 108 |
+
--freeze_layers 0 \
|
| 109 |
+
--epochs 20 \
|
| 110 |
--batch_size 2 \
|
| 111 |
+
--learning_rate 1e-5 \
|
| 112 |
--max_length 1024 \
|
| 113 |
--output_dir ./out \
|
| 114 |
--device cuda
|
|
|
|
| 116 |
|
| 117 |
#### Key training options
|
| 118 |
|
| 119 |
+
- `--num_layers N`: Total transformer layers (6, 8, 12, 16, or 18)
|
| 120 |
+
- `--freeze_layers N`: Freeze bottom N layers (use 0 for training from scratch)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
- `--checkpoint path.pth`: Resume from a saved checkpoint
|
| 122 |
+
- `--model_config`: ESM model config (default: `facebook/esm1b_t33_650M_UR50S`)
|
| 123 |
+
|
| 124 |
+
**Important:** When training from scratch, use `--freeze_layers 0` to ensure all layers (including embeddings) remain trainable. The default is 20, which would freeze most layers.
|
| 125 |
|
| 126 |
### Inference
|
| 127 |
|
| 128 |
```bash
|
| 129 |
+
python inference_ppiBTPE_2GPU.py \
|
| 130 |
+
--model_path out/ppiBTPE_epoch_17.pth \
|
| 131 |
--model_config facebook/esm1b_t33_650M_UR50S \
|
| 132 |
+
--num_layers 12 \
|
| 133 |
--input_file test_pairs.csv \
|
| 134 |
--output_file predictions.csv \
|
| 135 |
--batch_size 4 \
|
|
|
|
| 137 |
--device cuda
|
| 138 |
```
|
| 139 |
|
| 140 |
+
Multi-GPU inference:
|
| 141 |
+
```bash
|
| 142 |
+
python inference_ppiBTPE_2GPU.py \
|
| 143 |
+
--model_path out/ppiBTPE_final.pth \
|
| 144 |
+
--model_config facebook/esm1b_t33_650M_UR50S \
|
| 145 |
+
--num_layers 12 \
|
| 146 |
+
--input_file test_pairs.csv \
|
| 147 |
+
--output_file predictions.csv \
|
| 148 |
+
--device cuda:0,1
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
Output CSV columns: `seq1, seq2, Prediction, Probability_Friends, Probability_Enemies`
|
| 152 |
|
| 153 |
### ROC Analysis
|
| 154 |
|
|
|
|
| 164 |
|
| 165 |
## Architecture Diagram
|
| 166 |
|
| 167 |
+
The ASCII workflow diagram (`assets/ppiBTEP.png`) covers:
|
| 168 |
+
- **A.** Siamese input strategy (independent per-protein encoding)
|
| 169 |
+
- **B.** Model architecture (twin ESM-1b branches + concat classification head)
|
| 170 |
- **C.** Training pipeline
|
| 171 |
+
- **D.** Inference pipeline (multi-GPU)
|
| 172 |
|
| 173 |
> Note: the diagram shows Softmax in the classification head for clarity, but
|
| 174 |
> the implementation returns raw logits — softmax is applied implicitly by
|