ppiDCE / README.md

README: add HF download snippet, document 2-column CSV input, swap example paths to released checkpoint

7262a87 verified 7 days ago

6.91 kB

	---
	license: mit
	library_name: pytorch
	tags:
	- protein-protein-interaction
	- ppi
	- protein-language-model
	- esm-architecture
	- cross-encoder
	- trained-from-scratch
	- bioinformatics
	- biology
	pipeline_tag: feature-extraction
	---

	# ppiDCE

	A dual cross-encoder for binary protein-protein interaction (PPI) classification, inspired by the ESM-1b transformer architecture ([Rives et al., 2021](https://doi.org/10.1073/pnas.2016239118)) but substantially modified and trained from scratch rather than fine-tuned from the released ESM-1b checkpoint.

	![ppiDCE Architecture](assets/ppiDCE.png)

	## Overview

	ppiDCE adapts the ESM-1b transformer architecture -- a single-sequence masked language model with no native PPI capability -- for protein-protein interaction prediction by exploiting the tokenizer's sentence-pair encoding mode and training the modified model from scratch on PPI data. Both protein sequences are concatenated into a single input as `[CLS] Seq_A [SEP] Seq_B [EOS]`, enabling full bidirectional cross-attention between the two sequences at every transformer layer. The `[CLS]` token representation from the final layer captures joint inter-protein features and is passed through a dropout + linear classification head to produce binary interaction predictions with softmax probabilities.

	The model was developed for the Prochlorococcus marinus MED4 interactome, where it serves as one component of a tri-model consensus framework (alongside [ppiGPLM](https://github.com/kouroshSA/ppiGPLM) and [ppiBTEP](https://github.com/kouroshSA/ppiBTEP)) for computational PPI screening.

	## Architecture

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Foundation \| ESM-1b-inspired transformer (Rives et al., 2021) -- substantially modified, trained from scratch \|
	\| Strategy \| Cross-encoding (sentence-pair) \|
	\| Layers \| 12 (configurable) \|
	\| Classification \| [CLS] -> Dropout(0.1) -> Linear -> 2 \|
	\| Max sequence length \| 1,024 tokens \|
	\| Optimizer \| AdamW (lr = 2 x 10^-5) \|
	\| Loss \| Cross-Entropy \|

	### Cross-Encoding vs Single-Sequence

	Unlike the original ESM-1b which processes one protein at a time, ppiDCE feeds both proteins as a single concatenated input. This enables inter-protein residue-residue attention at every transformer layer -- the most expressive strategy for modeling pairwise interactions, at the cost of O((n+m)^2) attention complexity.

	## Installation

	### Prerequisites

	- Python 3.10+
	- CUDA-capable GPU (recommended)
	- conda (recommended) or pip

	### Setup

	```bash
	# Clone the repository
	git clone https://github.com/kouroshSA/ppiDCE.git
	cd ppiDCE

	# Create a conda environment
	conda create -n esm python=3.10
	conda activate esm
	pip install -r requirements.txt
	```

	## Repository Structure

	```
	ppiDCE/
	\|-- train_ppiDCE.py # Training script
	\|-- inference_ppiDCE.py # Batch inference script
	\|-- roc_analysis_color_threshold_F1e.py # ROC curve analysis with F1 optimization
	\|-- assets/
	\| +-- ppiDCE.png # ASCII workflow diagram
	\|-- requirements.txt
	\|-- LICENSE
	+-- README.md
	```

	## Usage

	### Data Format

	Training and inference use CSV files with columns: `protein1_seq, protein2_seq, label`

	- `protein1_seq`, `protein2_seq`: Amino acid sequences
	- `label`: `0` (non-interacting) or `1` (interacting)

	For inference-only input, only the first two columns are required.

	### Training

	```bash
	# Train from scratch with 12 layers
	python train_ppiDCE.py \
	--train_file train.csv \
	--val_file val.csv \
	--model_config facebook/esm1b_t33_650M_UR50S \
	--from_scratch \
	--num_layers 12 \
	--epochs 10 \
	--batch_size 2 \
	--learning_rate 2e-5 \
	--max_length 1024 \
	--output_dir ./out \
	--device cuda
	```

	#### Key training options

	- `--from_scratch`: Initialize the ESM backbone with random weights instead of
	loading pretrained ESM-1b. Useful when you suspect single-sequence
	pretraining priors are inappropriate for your task.
	- `--num_layers N`: Set total transformer layers when training from scratch
	- `--freeze_layers N`: Freeze bottom N layers during fine-tuning
	- `--add_layers N`: Append extra transformer layers on top
	- `--checkpoint path.pth`: Resume from a saved checkpoint
	- `--suppress_warnings`: Suppress tokenizer truncation warnings

	### Quick start: fetch the checkpoint from Hugging Face

	The released MED4 checkpoint (`checkpoints/ppiDCE_epoch8.pth`, 12-layer)
	lives on this Hugging Face repo. Pull it without cloning the GitHub mirror:

	```python
	from huggingface_hub import hf_hub_download

	ckpt_path = hf_hub_download(
	repo_id="kouroshSA/ppiDCE",
	filename="checkpoints/ppiDCE_epoch8.pth",
	)
	print(ckpt_path) # pass this string to --model_path
	```

	`inference_ppiDCE.py` takes the checkpoint path as a direct `--model_path`
	argument, so no rename or specific directory layout is required — point
	it straight at the file you just downloaded.

	### Input file format

	The inference script expects a CSV with two columns of plain amino-acid
	sequences (one protein pair per row — no delimiter tokens, no length
	markers, no chevrons):

	```
	seq1,seq2
	MKLR...QSH,MSEDF...VKN
	MQAG...PIA,MTRRL...EEP
	```

	A ready-made example is shipped with the repo:
	[`MED4-PPIs-low-confidence_ppiTEPM_prompts.csv`](MED4-PPIs-low-confidence_ppiTEPM_prompts.csv).
	The labeled PRS/RRS reference sets (`MED4_PRS_100.csv`, `MED4_RRS_100.csv`)
	include a third label column, which the inference script ignores — only
	the first two columns are read.

	### Inference

	```bash
	python inference_ppiDCE.py \
	--model_path checkpoints/ppiDCE_epoch8.pth \
	--model_config facebook/esm1b_t33_650M_UR50S \
	--input_file MED4-PPIs-low-confidence_ppiTEPM_prompts.csv \
	--output_file predictions.csv \
	--batch_size 4 \
	--max_length 1024 \
	--device cuda
	```

	Output CSV columns: `seq1, seq2, pred_label, prob_0, prob_1`

	### ROC Analysis

	Evaluate model predictions using ROC curve analysis with threshold-colored visualization and F1 optimization:

	```bash
	python roc_analysis_color_threshold_F1e.py \
	--input_csv probabilities.csv \
	--output_file roc_curve.png
	```

	The input CSV should have two columns: PRS (positive) and RRS (random/negative) probability values.

	## Architecture Diagram

	The ASCII workflow diagram (`assets/ppiDCE.png`) covers:
	- A. Cross-encoding input strategy
	- B. Model architecture (ESM-1b-style backbone + classification head)
	- C. Training pipeline
	- D. Inference pipeline

	> Note: the diagram shows Softmax in the classification head for clarity, but
	> the implementation returns raw logits — softmax is applied implicitly by
	> CrossEntropyLoss during training and explicitly during inference.

	## Citation

	If you use this software, please cite:

	```
	Daakour, S. et al. (2026).
	```

	## License

	This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.