File size: 6,911 Bytes
e683105 b2c1977 e683105 b2c1977 e683105 b2c1977 e683105 b2c1977 df2cc2b b2c1977 e683105 b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b 7262a87 df2cc2b b2c1977 7262a87 df2cc2b 7262a87 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b b2c1977 df2cc2b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | ---
license: mit
library_name: pytorch
tags:
- protein-protein-interaction
- ppi
- protein-language-model
- esm-architecture
- cross-encoder
- trained-from-scratch
- bioinformatics
- biology
pipeline_tag: feature-extraction
---
# ppiDCE
A dual cross-encoder for binary protein-protein interaction (PPI) classification, inspired by the ESM-1b transformer architecture ([Rives et al., 2021](https://doi.org/10.1073/pnas.2016239118)) but **substantially modified and trained from scratch** rather than fine-tuned from the released ESM-1b checkpoint.

## Overview
ppiDCE adapts the ESM-1b transformer architecture -- a single-sequence masked language model with no native PPI capability -- for protein-protein interaction prediction by exploiting the tokenizer's sentence-pair encoding mode and training the modified model from scratch on PPI data. Both protein sequences are concatenated into a single input as `[CLS] Seq_A [SEP] Seq_B [EOS]`, enabling full bidirectional cross-attention between the two sequences at every transformer layer. The `[CLS]` token representation from the final layer captures joint inter-protein features and is passed through a dropout + linear classification head to produce binary interaction predictions with softmax probabilities.
The model was developed for the *Prochlorococcus marinus* MED4 interactome, where it serves as one component of a tri-model consensus framework (alongside [ppiGPLM](https://github.com/kouroshSA/ppiGPLM) and [ppiBTEP](https://github.com/kouroshSA/ppiBTEP)) for computational PPI screening.
## Architecture
| Parameter | Value |
|-----------|-------|
| Foundation | ESM-1b-inspired transformer (Rives et al., 2021) -- substantially modified, trained from scratch |
| Strategy | Cross-encoding (sentence-pair) |
| Layers | 12 (configurable) |
| Classification | [CLS] -> Dropout(0.1) -> Linear -> 2 |
| Max sequence length | 1,024 tokens |
| Optimizer | AdamW (lr = 2 x 10^-5) |
| Loss | Cross-Entropy |
### Cross-Encoding vs Single-Sequence
Unlike the original ESM-1b which processes one protein at a time, ppiDCE feeds both proteins as a single concatenated input. This enables inter-protein residue-residue attention at every transformer layer -- the most expressive strategy for modeling pairwise interactions, at the cost of O((n+m)^2) attention complexity.
## Installation
### Prerequisites
- Python 3.10+
- CUDA-capable GPU (recommended)
- conda (recommended) or pip
### Setup
```bash
# Clone the repository
git clone https://github.com/kouroshSA/ppiDCE.git
cd ppiDCE
# Create a conda environment
conda create -n esm python=3.10
conda activate esm
pip install -r requirements.txt
```
## Repository Structure
```
ppiDCE/
|-- train_ppiDCE.py # Training script
|-- inference_ppiDCE.py # Batch inference script
|-- roc_analysis_color_threshold_F1e.py # ROC curve analysis with F1 optimization
|-- assets/
| +-- ppiDCE.png # ASCII workflow diagram
|-- requirements.txt
|-- LICENSE
+-- README.md
```
## Usage
### Data Format
Training and inference use CSV files with columns: `protein1_seq, protein2_seq, label`
- `protein1_seq`, `protein2_seq`: Amino acid sequences
- `label`: `0` (non-interacting) or `1` (interacting)
For inference-only input, only the first two columns are required.
### Training
```bash
# Train from scratch with 12 layers
python train_ppiDCE.py \
--train_file train.csv \
--val_file val.csv \
--model_config facebook/esm1b_t33_650M_UR50S \
--from_scratch \
--num_layers 12 \
--epochs 10 \
--batch_size 2 \
--learning_rate 2e-5 \
--max_length 1024 \
--output_dir ./out \
--device cuda
```
#### Key training options
- `--from_scratch`: Initialize the ESM backbone with random weights instead of
loading pretrained ESM-1b. Useful when you suspect single-sequence
pretraining priors are inappropriate for your task.
- `--num_layers N`: Set total transformer layers when training from scratch
- `--freeze_layers N`: Freeze bottom N layers during fine-tuning
- `--add_layers N`: Append extra transformer layers on top
- `--checkpoint path.pth`: Resume from a saved checkpoint
- `--suppress_warnings`: Suppress tokenizer truncation warnings
### Quick start: fetch the checkpoint from Hugging Face
The released MED4 checkpoint (`checkpoints/ppiDCE_epoch8.pth`, 12-layer)
lives on this Hugging Face repo. Pull it without cloning the GitHub mirror:
```python
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="kouroshSA/ppiDCE",
filename="checkpoints/ppiDCE_epoch8.pth",
)
print(ckpt_path) # pass this string to --model_path
```
`inference_ppiDCE.py` takes the checkpoint path as a direct `--model_path`
argument, so no rename or specific directory layout is required — point
it straight at the file you just downloaded.
### Input file format
The inference script expects a CSV with two columns of plain amino-acid
sequences (one protein pair per row — no delimiter tokens, no length
markers, no chevrons):
```
seq1,seq2
MKLR...QSH,MSEDF...VKN
MQAG...PIA,MTRRL...EEP
```
A ready-made example is shipped with the repo:
[`MED4-PPIs-low-confidence_ppiTEPM_prompts.csv`](MED4-PPIs-low-confidence_ppiTEPM_prompts.csv).
The labeled PRS/RRS reference sets (`MED4_PRS_100.csv`, `MED4_RRS_100.csv`)
include a third label column, which the inference script ignores — only
the first two columns are read.
### Inference
```bash
python inference_ppiDCE.py \
--model_path checkpoints/ppiDCE_epoch8.pth \
--model_config facebook/esm1b_t33_650M_UR50S \
--input_file MED4-PPIs-low-confidence_ppiTEPM_prompts.csv \
--output_file predictions.csv \
--batch_size 4 \
--max_length 1024 \
--device cuda
```
Output CSV columns: `seq1, seq2, pred_label, prob_0, prob_1`
### ROC Analysis
Evaluate model predictions using ROC curve analysis with threshold-colored visualization and F1 optimization:
```bash
python roc_analysis_color_threshold_F1e.py \
--input_csv probabilities.csv \
--output_file roc_curve.png
```
The input CSV should have two columns: PRS (positive) and RRS (random/negative) probability values.
## Architecture Diagram
The ASCII workflow diagram (`assets/ppiDCE.png`) covers:
- **A.** Cross-encoding input strategy
- **B.** Model architecture (ESM-1b-style backbone + classification head)
- **C.** Training pipeline
- **D.** Inference pipeline
> Note: the diagram shows Softmax in the classification head for clarity, but
> the implementation returns raw logits — softmax is applied implicitly by
> CrossEntropyLoss during training and explicitly during inference.
## Citation
If you use this software, please cite:
```
Daakour, S. et al. (2026).
```
## License
This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.
|