File size: 9,663 Bytes
fc65443 b7236cd a6dfb61 b7236cd fc65443 b7236cd fc65443 b7236cd fc65443 b7236cd fc65443 b7236cd fc65443 b7236cd fc65443 b7236cd fc65443 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 | ---
license: mit
library_name: pytorch
tags:
- protein-protein-interaction
- ppi
- protein-language-model
- gpt-2
- nanogpt
- character-level
- trained-from-scratch
- bioinformatics
- biology
pipeline_tag: text-generation
---
# ppiGPLM
A GPT-2 small protein language model trained from scratch on protein-pair prompts and used for binary protein-protein interaction (PPI) classification via next-token prediction. The implementation is based on [nanoGPT](https://github.com/karpathy/nanoGPT) by Andrej Karpathy, with character-level tokenization over amino acids.

## Overview
ppiGPLM uses a GPT-2 small architecture (12 layers, 12 attention heads, 768 embedding dimensions) with character-level tokenization to predict whether two proteins interact. Rather than using a separate classification head, ppiGPLM frames PPI prediction as next-token prediction: given a structured prompt encoding a protein pair, the model predicts a binary label (`0` or `1`) as the next token. Softmax probabilities over the label tokens provide continuous interaction scores.
The model was developed for the *Prochlorococcus marinus* MED4 interactome, where it serves as one component of a tri-model consensus framework for computational PPI screening.
## Architecture
| Parameter | Value |
|-----------|-------|
| Architecture | GPT-2 small |
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Context length | 4,096 tokens |
| Tokenization | Character-level (one token per amino acid) |
| Dropout | 0.2 |
| Optimizer | AdamW (lr = 5e-4, beta2 = 0.99) |
| Training iterations | 8,000 |
## Installation
### Prerequisites
- Python 3.8+
- CUDA-capable GPU (recommended) or CPU
- conda (recommended) or pip
### Setup
```bash
# Clone the repository
git clone https://github.com/kouroshSA/ppiGPLM.git
cd ppiGPLM
# Create a conda environment
conda create -n gpt python=3.10
conda activate gpt
pip install -r requirements.txt
```
## Repository Structure
```
ppiGPLM/
|-- model.py # GPT model definition
|-- train_.py # Training loop
|-- sample_fasta3.3_softmax_error_handling3e.py # Batch inference script
|-- LES-wrapper.py # Learning Efficiency Score evaluation wrapper
|-- LES-wrapper.md # LES-wrapper documentation
|-- roc_analysis_color_threshold_F1e.py # ROC curve analysis
|-- configurator.py # Configuration utility
|-- config/
| |-- train_par_gpt2-s_scratch.py # Training config (GPT-2 small, from scratch)
| +-- finetune_label3.py # Fine-tuning config
|-- data/
| +-- MED4_char/ # MED4 PPI dataset
| |-- prepare.py # Character-level tokenizer
| +-- meta.pkl # Vocabulary (stoi/itos mappings)
|-- assets/
| |-- ppiGPLM.png # ASCII workflow diagram
| |-- tri_model_consensus.svg # Tri-model consensus framework (SVG)
| +-- tri_model_consensus.png # Tri-model consensus framework (PNG)
|-- requirements.txt
|-- LICENSE
+-- README.md
```
## Usage
### Quick start: fetch the checkpoint from Hugging Face
The released MED4 checkpoint (`checkpoints/ppiGPLM_ckpt_7e.pt`, epoch β 71)
lives on this Hugging Face repo. To pull it without cloning the GitHub
mirror:
```python
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="kouroshSA/ppiGPLM",
filename="checkpoints/ppiGPLM_ckpt_7e.pt",
)
meta_path = hf_hub_download(
repo_id="kouroshSA/ppiGPLM",
filename="data/MED4_char/meta.pkl",
)
```
`meta.pkl` carries the character vocabulary (`stoi`/`itos`) the inference
script needs to encode protein sequences.
#### Wiring the checkpoint into the inference script
`sample_fasta3.3_softmax_error_handling3e.py` loads from
`<model_dir>/ckpt.pt`, where `model_dir = 'out'` is set inline near the
top of the script and the trailing `ckpt.pt` filename is **hardcoded**.
Two ways to make the downloaded file work:
**Option A β place the file where the defaults expect it:**
```bash
mkdir -p out
cp /path/to/ppiGPLM_ckpt_7e.pt out/ckpt.pt # note the required rename
```
then run the inference command below as-is.
**Option B β override `model_dir` via the poor-man's configurator
(`configurator.py`):**
```bash
mkdir -p my_ckpts
cp /path/to/ppiGPLM_ckpt_7e.pt my_ckpts/ckpt.pt # still needs to be ckpt.pt
python sample_fasta3.3_softmax_error_handling3e.py \
--input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
--output_dir ppi_results \
--output_prefix my_predictions \
--model_dir=my_ckpts
```
Either way, the on-disk filename must be `ckpt.pt`; editing it out of the
script is also possible (change the `model_dir` default near the top, or
the literal `'ckpt.pt'` further down) but the rename above is simpler.
The character vocabulary (`meta.pkl`) is read from
`data/<dataset>/meta.pkl`, where `<dataset>` comes from
`checkpoint['config']['dataset']` (`MED4_char` for this checkpoint). Make
sure that path exists β either keep the `data/MED4_char/` directory from
the GitHub clone, or place the downloaded `meta_path` there.
### Input file format
Each line of `--input_file` is one structured prompt (one protein pair),
not a free-form FASTA record. The schema is:
```
<ps1>,SEQ_A,<ps2>,SEQ_B,<l1>,LEN_A,<l2>,LEN_B,<l3>,<
```
- `<ps1>`, `<ps2>`: protein-sequence delimiter tokens
- `<l1>`, `<l2>`, `<l3>`: length-field delimiter tokens
- The trailing `,<` is **the cue**: it tells the model the next token to
generate is the classification label (`1` = interacting, `0` = not).
Don't omit it.
A ready-made example is shipped with the repo:
[`MED4-PPIs-low-confidence_ppiGPLM_prompts.csv`](MED4-PPIs-low-confidence_ppiGPLM_prompts.csv)
β inspect or copy its format when building your own input file.
### Batch Inference
Run inference on a file of prompts:
```bash
python sample_fasta3.3_softmax_error_handling3e.py \
--input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
--output_dir ppi_results \
--output_prefix my_predictions
```
This produces:
- `*_classifications.txt`: Full model output in FASTA-like format
- `*_probabilities.csv`: Per-pair probabilities for class 1 and class 0
#### About the inference script
`sample_fasta3.3_softmax_error_handling3e.py` is derived from Karpathy's
nanoGPT `sample.py` β it reuses the same `GPTConfig`/`GPT` classes from
`model.py`, the `init_from = 'resume'` checkpoint-loading idiom, and the
`_orig_mod.` prefix strip for `torch.compile`-wrapped state dicts. It is
**not** a drop-in copy, however. The modifications make it a batch
classifier rather than a generic sampler:
- batch input: one prompt per line read from `--input_file`, processed
sequentially with no interactive loop;
- classifier-style output: per-prompt softmax probabilities of the next
token being `"1"` vs `"0"`, written to `*_probabilities.csv` alongside
the conventional `generate()` output dump in `*_classifications.txt`;
- robustness against real-world inputs: automatic block-size detection
(`checkpoint['model_args']['block_size']` or
`model.config.n_positions`), head-clipping when a prompt exceeds the
context window so the trailing `<` label-cue token survives, and
out-of-vocabulary character replacement (defaults to `A`).
The lineage is **GPT-2 β nanoGPT β ppiGPLM's batch-classifier sampler**.
### Training
#### Prepare data
```bash
python data/MED4_char/prepare.py
```
This creates `train.bin`, `val.bin`, and `meta.pkl` from the input training data.
#### Train the model
```bash
# Single GPU
python train_.py config/train_par_gpt2-s_scratch.py
# Multi-GPU (2 GPUs)
torchrun --standalone --nproc_per_node=2 train_.py config/train_par_gpt2-s_scratch.py
```
### Learning Efficiency Score (LES) Evaluation
The LES-wrapper automates evaluation across multiple training checkpoints, computing ROC-AUC, F1, and optimal threshold at each checkpoint and deriving integrated Learning Efficiency Scores:
```bash
python LES-wrapper.py \
--checkpoint_dir out \
--prs_file PRS.txt \
--rrs_file RRS.txt \
--output_dir LES_results \
--vanilla
```
See [LES-wrapper.md](LES-wrapper.md) for full documentation.
### Standalone ROC Analysis
```bash
python roc_analysis_color_threshold_F1e.py \
--prs_file ppi_results/PRS_probabilities.csv \
--rrs_file ppi_results/RRS_probabilities.csv
```
## Architecture Diagrams
The ASCII workflow diagram (`assets/ppiGPLM.png`) covers:
- **A.** Prompt-based input strategy (character-level tokenization)
- **B.** Model architecture (GPT-2 small, causal self-attention)
- **C.** Training pipeline
- **D.** Inference pipeline with LES evaluation
> Note: the diagram lists "Flash Attention" β this path is taken automatically
> when running on PyTorch β₯ 2.0; older versions fall back to the manual
> scaled-dot-product implementation. Numerical results are equivalent.
See `assets/tri_model_consensus.svg` for the tri-model consensus framework with [ppiDCE](https://github.com/kouroshSA/ppiDCE) and [ppiBTEP](https://github.com/kouroshSA/ppiBTEP).
## Citation
If you use this software, please cite:
```
Daakour, S. et al. (2026).
```
This software is built on nanoGPT:
```
Karpathy, A. (2022). nanoGPT. https://github.com/karpathy/nanoGPT
```
## License
This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.
The original nanoGPT framework is by Andrej Karpathy (MIT License, 2022). Modifications and additions for protein-protein interaction prediction are by Kourosh Salehi-Ashtiani (MIT License, 2026).
|