Clarify model is ESM-1b-inspired (trained from scratch, modified)
Browse files
README.md
CHANGED
|
@@ -1,14 +1,13 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
library_name: pytorch
|
| 4 |
-
base_model: facebook/esm1b_t33_650M_UR50S
|
| 5 |
tags:
|
| 6 |
- protein-protein-interaction
|
| 7 |
- ppi
|
| 8 |
- protein-language-model
|
| 9 |
-
- esm
|
| 10 |
-
- esm-1b
|
| 11 |
- siamese
|
|
|
|
| 12 |
- bioinformatics
|
| 13 |
- biology
|
| 14 |
pipeline_tag: feature-extraction
|
|
@@ -16,13 +15,13 @@ pipeline_tag: feature-extraction
|
|
| 16 |
|
| 17 |
# ppiBTEP
|
| 18 |
|
| 19 |
-
A Siamese (twin-branch) protein-protein interaction classifier
|
| 20 |
|
| 21 |

|
| 22 |
|
| 23 |
## Overview
|
| 24 |
|
| 25 |
-
ppiBTEP processes each protein independently through a shared ESM-1b encoder -- no cross-sequence attention is used between the two proteins. Each branch extracts the `[CLS]` token embedding from the final transformer layer, the two embeddings are concatenated, and a dropout + linear classification head produces binary interaction predictions with softmax probabilities.
|
| 26 |
|
| 27 |
Unlike the cross-encoding approach (see [ppiDCE](https://github.com/kouroshSA/ppiDCE)), ppiBTEP must capture interaction-predictive features entirely from each protein's own sequence context. This makes it faster per pair and allows protein representations to be precomputed and reused, at the cost of not modeling direct inter-protein residue dependencies.
|
| 28 |
|
|
@@ -32,7 +31,7 @@ The model was developed for the *Prochlorococcus marinus* MED4 interactome, wher
|
|
| 32 |
|
| 33 |
| Parameter | Value |
|
| 34 |
|-----------|-------|
|
| 35 |
-
| Foundation | ESM-1b (
|
| 36 |
| Strategy | Siamese / twin-branch |
|
| 37 |
| Layers | 12 default; 6, 8, 12, 16, or 18 selectable via --num_layers |
|
| 38 |
| Classification | Concat [CLS_A, CLS_B] -> Dropout(0.1) -> Linear -> 2 |
|
|
@@ -166,7 +165,7 @@ The input CSV should have two columns: PRS (positive) and RRS (random/negative)
|
|
| 166 |
|
| 167 |
The ASCII workflow diagram (`assets/ppiBTEP.png`) covers:
|
| 168 |
- **A.** Siamese input strategy (independent per-protein encoding)
|
| 169 |
-
- **B.** Model architecture (twin ESM-1b branches + concat classification head)
|
| 170 |
- **C.** Training pipeline
|
| 171 |
- **D.** Inference pipeline (multi-GPU)
|
| 172 |
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
library_name: pytorch
|
|
|
|
| 4 |
tags:
|
| 5 |
- protein-protein-interaction
|
| 6 |
- ppi
|
| 7 |
- protein-language-model
|
| 8 |
+
- esm-architecture
|
|
|
|
| 9 |
- siamese
|
| 10 |
+
- trained-from-scratch
|
| 11 |
- bioinformatics
|
| 12 |
- biology
|
| 13 |
pipeline_tag: feature-extraction
|
|
|
|
| 15 |
|
| 16 |
# ppiBTEP
|
| 17 |
|
| 18 |
+
A Siamese (twin-branch) protein-protein interaction classifier inspired by the ESM-1b transformer architecture ([Rives et al., 2021](https://doi.org/10.1073/pnas.2016239118)), but **substantially modified and trained from scratch** rather than fine-tuned from the released ESM-1b checkpoint. Also designated SiameseBTPE (BERT-Twin Protein Encoder).
|
| 19 |
|
| 20 |

|
| 21 |
|
| 22 |
## Overview
|
| 23 |
|
| 24 |
+
ppiBTEP processes each protein independently through a shared ESM-1b-style transformer encoder -- no cross-sequence attention is used between the two proteins. Each branch extracts the `[CLS]` token embedding from the final transformer layer, the two embeddings are concatenated, and a dropout + linear classification head produces binary interaction predictions with softmax probabilities.
|
| 25 |
|
| 26 |
Unlike the cross-encoding approach (see [ppiDCE](https://github.com/kouroshSA/ppiDCE)), ppiBTEP must capture interaction-predictive features entirely from each protein's own sequence context. This makes it faster per pair and allows protein representations to be precomputed and reused, at the cost of not modeling direct inter-protein residue dependencies.
|
| 27 |
|
|
|
|
| 31 |
|
| 32 |
| Parameter | Value |
|
| 33 |
|-----------|-------|
|
| 34 |
+
| Foundation | ESM-1b-inspired transformer (Rives et al., 2021) -- substantially modified, trained from scratch |
|
| 35 |
| Strategy | Siamese / twin-branch |
|
| 36 |
| Layers | 12 default; 6, 8, 12, 16, or 18 selectable via --num_layers |
|
| 37 |
| Classification | Concat [CLS_A, CLS_B] -> Dropout(0.1) -> Linear -> 2 |
|
|
|
|
| 165 |
|
| 166 |
The ASCII workflow diagram (`assets/ppiBTEP.png`) covers:
|
| 167 |
- **A.** Siamese input strategy (independent per-protein encoding)
|
| 168 |
+
- **B.** Model architecture (twin ESM-1b-style branches + concat classification head)
|
| 169 |
- **C.** Training pipeline
|
| 170 |
- **D.** Inference pipeline (multi-GPU)
|
| 171 |
|