kouroshSA
/

ppiDCE

@@ -1,30 +1,54 @@
-# ppiDCE
-A dual cross-encoder for binary protein-protein interaction (PPI) classification, built on ESM-1b ([Rives et al., 2021](https://doi.org/10.1073/pnas.2016239118)).
-![ppiDCE Architecture](assets/ppiDCE.png)
 ## Overview
-ppiDCE repurposes ESM-1b -- a single-sequence masked language model with no native PPI capability -- for protein-protein interaction prediction by exploiting its tokenizer's sentence-pair encoding mode. Both protein sequences are concatenated into a single input as `[CLS] Seq_A [SEP] Seq_B [EOS]`, enabling full bidirectional cross-attention between the two sequences at every transformer layer. The `[CLS]` token representation from the final layer captures joint inter-protein features and is passed through a dropout + linear classification head to produce binary interaction predictions with softmax probabilities.
-The model was developed for the *Prochlorococcus marinus* MED4 interactome, where it serves as one component of a tri-model consensus framework (alongside [ppiGPLM](https://github.com/kouroshSA/ppiGPLM) and [ppiBTEP](https://github.com/kouroshSA/ppiBTEP)) for computational PPI screening.
 ## Architecture
 | Parameter | Value |
 |-----------|-------|
 | Foundation | ESM-1b (facebook/esm1b_t33_650M_UR50S) |
-| Strategy | Cross-encoding (sentence-pair) |
-| Layers | 12 (configurable) |
-| Classification | [CLS] -> Dropout(0.1) -> Linear -> 2 |
 | Max sequence length | 1,024 tokens |
-| Optimizer | AdamW (lr = 2 x 10^-5) |
 | Loss | Cross-Entropy |
-### Cross-Encoding vs Single-Sequence
-Unlike the original ESM-1b which processes one protein at a time, ppiDCE feeds both proteins as a single concatenated input. This enables inter-protein residue-residue attention at every transformer layer -- the most expressive strategy for modeling pairwise interactions, at the cost of O((n+m)^2) attention complexity.
 ## Installation
@@ -38,8 +62,8 @@ Unlike the original ESM-1b which processes one protein at a time, ppiDCE feeds b
 ```bash
 # Clone the repository
-git clone https://github.com/kouroshSA/ppiDCE.git
-cd ppiDCE
 # Create a conda environment
 conda create -n esm python=3.10
@@ -50,12 +74,12 @@ pip install -r requirements.txt
 ## Repository Structure
 ```
-ppiDCE/
-|-- train_ppiDCE.py                    # Training script
-|-- inference_ppiDCE.py                # Batch inference script
 |-- roc_analysis_color_threshold_F1e.py  # ROC curve analysis with F1 optimization
 |-- assets/
-|   +-- ppiDCE.png                     # ASCII workflow diagram
 |-- requirements.txt
 |-- LICENSE
 +-- README.md
@@ -65,10 +89,10 @@ ppiDCE/
 ### Data Format
-Training and inference use CSV files with columns: `protein1_seq, protein2_seq, label`
-- `protein1_seq`, `protein2_seq`: Amino acid sequences
-- `label`: `0` (non-interacting) or `1` (interacting)
 For inference-only input, only the first two columns are required.
@@ -76,15 +100,15 @@ For inference-only input, only the first two columns are required.
 ```bash
 # Train from scratch with 12 layers
-python train_ppiDCE.py \
     --train_file train.csv \
     --val_file val.csv \
     --model_config facebook/esm1b_t33_650M_UR50S \
-    --from_scratch \
     --num_layers 12 \
-    --epochs 10 \
     --batch_size 2 \
-    --learning_rate 2e-5 \
     --max_length 1024 \
     --output_dir ./out \
     --device cuda
@@ -92,21 +116,20 @@ python train_ppiDCE.py \
 #### Key training options
-- `--from_scratch`: Initialize the ESM backbone with random weights instead of
-  loading pretrained ESM-1b. Useful when you suspect single-sequence
-  pretraining priors are inappropriate for your task.
-- `--num_layers N`: Set total transformer layers when training from scratch
-- `--freeze_layers N`: Freeze bottom N layers during fine-tuning
-- `--add_layers N`: Append extra transformer layers on top
 - `--checkpoint path.pth`: Resume from a saved checkpoint
-- `--suppress_warnings`: Suppress tokenizer truncation warnings
 ### Inference
 ```bash
-python inference_ppiDCE.py \
-    --model_path out/ppiDCE_epoch8.pth \
     --model_config facebook/esm1b_t33_650M_UR50S \
     --input_file test_pairs.csv \
     --output_file predictions.csv \
     --batch_size 4 \
@@ -114,7 +137,18 @@ python inference_ppiDCE.py \
     --device cuda
 ```
-Output CSV columns: `seq1, seq2, pred_label, prob_0, prob_1`
 ### ROC Analysis
@@ -130,11 +164,11 @@ The input CSV should have two columns: PRS (positive) and RRS (random/negative)
 ## Architecture Diagram
-The ASCII workflow diagram (`assets/ppiDCE.png`) covers:
-- **A.** Cross-encoding input strategy
-- **B.** Model architecture (ESM-1b backbone + classification head)
 - **C.** Training pipeline
-- **D.** Inference pipeline
 > Note: the diagram shows Softmax in the classification head for clarity, but
 > the implementation returns raw logits — softmax is applied implicitly by

+---
+license: mit
+library_name: pytorch
+base_model: facebook/esm1b_t33_650M_UR50S
+tags:
+  - protein-protein-interaction
+  - ppi
+  - protein-language-model
+  - esm
+  - esm-1b
+  - siamese
+  - bioinformatics
+  - biology
+pipeline_tag: feature-extraction
+---
+# ppiBTEP
+A Siamese (twin-branch) protein-protein interaction classifier built on ESM-1b ([Rives et al., 2021](https://doi.org/10.1073/pnas.2016239118)). Also designated SiameseBTPE (BERT-Twin Protein Encoder).
+![ppiBTEP Architecture](assets/ppiBTEP.png)
 ## Overview
+ppiBTEP processes each protein independently through a shared ESM-1b encoder -- no cross-sequence attention is used between the two proteins. Each branch extracts the `[CLS]` token embedding from the final transformer layer, the two embeddings are concatenated, and a dropout + linear classification head produces binary interaction predictions with softmax probabilities.
+Unlike the cross-encoding approach (see [ppiDCE](https://github.com/kouroshSA/ppiDCE)), ppiBTEP must capture interaction-predictive features entirely from each protein's own sequence context. This makes it faster per pair and allows protein representations to be precomputed and reused, at the cost of not modeling direct inter-protein residue dependencies.
+The model was developed for the *Prochlorococcus marinus* MED4 interactome, where it serves as one component of a tri-model consensus framework (alongside [ppiGPLM](https://github.com/kouroshSA/ppiGPLM) and [ppiDCE](https://github.com/kouroshSA/ppiDCE)) for computational PPI screening.
 ## Architecture
 | Parameter | Value |
 |-----------|-------|
 | Foundation | ESM-1b (facebook/esm1b_t33_650M_UR50S) |
+| Strategy | Siamese / twin-branch |
+| Layers | 12 default; 6, 8, 12, 16, or 18 selectable via --num_layers |
+| Classification | Concat [CLS_A, CLS_B] -> Dropout(0.1) -> Linear -> 2 |
 | Max sequence length | 1,024 tokens |
+| Optimizer | AdamW (lr = 1 x 10^-5) |
 | Loss | Cross-Entropy |
+### Siamese vs Cross-Encoder
+| | ppiDCE (Cross-Encoder) | ppiBTEP (Siamese) |
+|---|---|---|
+| Input | `[CLS] Seq_A [SEP] Seq_B` (joint) | `[CLS] Seq_A` and `[CLS] Seq_B` (separate) |
+| Cross-attention | Full bidirectional at every layer | None |
+| Classification | Single [CLS] -> Linear | Concat [CLS_A, CLS_B] -> Linear |
+| Complexity | O((n+m)^2) | O(n^2) + O(m^2) |
+| Speed | Slower (joint encoding) | Faster (independent, reusable) |
 ## Installation
 ```bash
 # Clone the repository
+git clone https://github.com/kouroshSA/ppiBTEP.git
+cd ppiBTEP
 # Create a conda environment
 conda create -n esm python=3.10
 ## Repository Structure
 ```
+ppiBTEP/
+|-- train_ppiBTPE3b.py                   # Training script
+|-- inference_ppiBTPE_2GPU.py            # Batch inference script (multi-GPU)
 |-- roc_analysis_color_threshold_F1e.py  # ROC curve analysis with F1 optimization
 |-- assets/
+|   +-- ppiBTEP.png                      # ASCII workflow diagram
 |-- requirements.txt
 |-- LICENSE
 +-- README.md
 ### Data Format
+Training and inference use CSV files with columns: `seq1, seq2, label`
+- `seq1`, `seq2`: Amino acid sequences
+- `label`: `0` or `enemies` (non-interacting), `1` or `friends` (interacting)
 For inference-only input, only the first two columns are required.
 ```bash
 # Train from scratch with 12 layers
+python train_ppiBTPE3b.py \
     --train_file train.csv \
     --val_file val.csv \
     --model_config facebook/esm1b_t33_650M_UR50S \
     --num_layers 12 \
+    --freeze_layers 0 \
+    --epochs 20 \
     --batch_size 2 \
+    --learning_rate 1e-5 \
     --max_length 1024 \
     --output_dir ./out \
     --device cuda
 #### Key training options
+- `--num_layers N`: Total transformer layers (6, 8, 12, 16, or 18)
+- `--freeze_layers N`: Freeze bottom N layers (use 0 for training from scratch)
 - `--checkpoint path.pth`: Resume from a saved checkpoint
+- `--model_config`: ESM model config (default: `facebook/esm1b_t33_650M_UR50S`)
+**Important:** When training from scratch, use `--freeze_layers 0` to ensure all layers (including embeddings) remain trainable. The default is 20, which would freeze most layers.
 ### Inference
 ```bash
+python inference_ppiBTPE_2GPU.py \
+    --model_path out/ppiBTPE_epoch_17.pth \
     --model_config facebook/esm1b_t33_650M_UR50S \
+    --num_layers 12 \
     --input_file test_pairs.csv \
     --output_file predictions.csv \
     --batch_size 4 \
     --device cuda
 ```
+Multi-GPU inference:
+```bash
+python inference_ppiBTPE_2GPU.py \
+    --model_path out/ppiBTPE_final.pth \
+    --model_config facebook/esm1b_t33_650M_UR50S \
+    --num_layers 12 \
+    --input_file test_pairs.csv \
+    --output_file predictions.csv \
+    --device cuda:0,1
+```
+Output CSV columns: `seq1, seq2, Prediction, Probability_Friends, Probability_Enemies`
 ### ROC Analysis
 ## Architecture Diagram
+The ASCII workflow diagram (`assets/ppiBTEP.png`) covers:
+- **A.** Siamese input strategy (independent per-protein encoding)
+- **B.** Model architecture (twin ESM-1b branches + concat classification head)
 - **C.** Training pipeline
+- **D.** Inference pipeline (multi-GPU)
 > Note: the diagram shows Softmax in the classification head for clarity, but
 > the implementation returns raw logits — softmax is applied implicitly by