File size: 9,663 Bytes
fc65443
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7236cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a6dfb61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7236cd
fc65443
b7236cd
 
fc65443
 
b7236cd
fc65443
 
b7236cd
 
 
 
 
 
 
 
 
fc65443
 
 
b7236cd
fc65443
 
 
b7236cd
fc65443
 
 
 
 
 
 
 
b7236cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc65443
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
license: mit
library_name: pytorch
tags:
  - protein-protein-interaction
  - ppi
  - protein-language-model
  - gpt-2
  - nanogpt
  - character-level
  - trained-from-scratch
  - bioinformatics
  - biology
pipeline_tag: text-generation
---

# ppiGPLM

A GPT-2 small protein language model trained from scratch on protein-pair prompts and used for binary protein-protein interaction (PPI) classification via next-token prediction. The implementation is based on [nanoGPT](https://github.com/karpathy/nanoGPT) by Andrej Karpathy, with character-level tokenization over amino acids.

![ppiGPLM](assets/ppiGPLM.png)

## Overview

ppiGPLM uses a GPT-2 small architecture (12 layers, 12 attention heads, 768 embedding dimensions) with character-level tokenization to predict whether two proteins interact. Rather than using a separate classification head, ppiGPLM frames PPI prediction as next-token prediction: given a structured prompt encoding a protein pair, the model predicts a binary label (`0` or `1`) as the next token. Softmax probabilities over the label tokens provide continuous interaction scores.

The model was developed for the *Prochlorococcus marinus* MED4 interactome, where it serves as one component of a tri-model consensus framework for computational PPI screening.

## Architecture

| Parameter | Value |
|-----------|-------|
| Architecture | GPT-2 small |
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Context length | 4,096 tokens |
| Tokenization | Character-level (one token per amino acid) |
| Dropout | 0.2 |
| Optimizer | AdamW (lr = 5e-4, beta2 = 0.99) |
| Training iterations | 8,000 |

## Installation

### Prerequisites

- Python 3.8+
- CUDA-capable GPU (recommended) or CPU
- conda (recommended) or pip

### Setup

```bash
# Clone the repository
git clone https://github.com/kouroshSA/ppiGPLM.git
cd ppiGPLM

# Create a conda environment
conda create -n gpt python=3.10
conda activate gpt
pip install -r requirements.txt
```

## Repository Structure

```
ppiGPLM/
|-- model.py                          # GPT model definition
|-- train_.py                         # Training loop
|-- sample_fasta3.3_softmax_error_handling3e.py  # Batch inference script
|-- LES-wrapper.py                    # Learning Efficiency Score evaluation wrapper
|-- LES-wrapper.md                    # LES-wrapper documentation
|-- roc_analysis_color_threshold_F1e.py  # ROC curve analysis
|-- configurator.py                   # Configuration utility
|-- config/
|   |-- train_par_gpt2-s_scratch.py   # Training config (GPT-2 small, from scratch)
|   +-- finetune_label3.py            # Fine-tuning config
|-- data/
|   +-- MED4_char/                    # MED4 PPI dataset
|       |-- prepare.py                # Character-level tokenizer
|       +-- meta.pkl                  # Vocabulary (stoi/itos mappings)
|-- assets/
|   |-- ppiGPLM.png                  # ASCII workflow diagram
|   |-- tri_model_consensus.svg      # Tri-model consensus framework (SVG)
|   +-- tri_model_consensus.png      # Tri-model consensus framework (PNG)
|-- requirements.txt
|-- LICENSE
+-- README.md
```

## Usage

### Quick start: fetch the checkpoint from Hugging Face

The released MED4 checkpoint (`checkpoints/ppiGPLM_ckpt_7e.pt`, epoch β‰ˆ 71)
lives on this Hugging Face repo. To pull it without cloning the GitHub
mirror:

```python
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="kouroshSA/ppiGPLM",
    filename="checkpoints/ppiGPLM_ckpt_7e.pt",
)
meta_path = hf_hub_download(
    repo_id="kouroshSA/ppiGPLM",
    filename="data/MED4_char/meta.pkl",
)
```

`meta.pkl` carries the character vocabulary (`stoi`/`itos`) the inference
script needs to encode protein sequences.

#### Wiring the checkpoint into the inference script

`sample_fasta3.3_softmax_error_handling3e.py` loads from
`<model_dir>/ckpt.pt`, where `model_dir = 'out'` is set inline near the
top of the script and the trailing `ckpt.pt` filename is **hardcoded**.
Two ways to make the downloaded file work:

**Option A β€” place the file where the defaults expect it:**

```bash
mkdir -p out
cp /path/to/ppiGPLM_ckpt_7e.pt out/ckpt.pt   # note the required rename
```

then run the inference command below as-is.

**Option B β€” override `model_dir` via the poor-man's configurator
(`configurator.py`):**

```bash
mkdir -p my_ckpts
cp /path/to/ppiGPLM_ckpt_7e.pt my_ckpts/ckpt.pt   # still needs to be ckpt.pt
python sample_fasta3.3_softmax_error_handling3e.py \
    --input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
    --output_dir ppi_results \
    --output_prefix my_predictions \
    --model_dir=my_ckpts
```

Either way, the on-disk filename must be `ckpt.pt`; editing it out of the
script is also possible (change the `model_dir` default near the top, or
the literal `'ckpt.pt'` further down) but the rename above is simpler.

The character vocabulary (`meta.pkl`) is read from
`data/<dataset>/meta.pkl`, where `<dataset>` comes from
`checkpoint['config']['dataset']` (`MED4_char` for this checkpoint). Make
sure that path exists β€” either keep the `data/MED4_char/` directory from
the GitHub clone, or place the downloaded `meta_path` there.

### Input file format

Each line of `--input_file` is one structured prompt (one protein pair),
not a free-form FASTA record. The schema is:

```
<ps1>,SEQ_A,<ps2>,SEQ_B,<l1>,LEN_A,<l2>,LEN_B,<l3>,<
```

- `<ps1>`, `<ps2>`: protein-sequence delimiter tokens
- `<l1>`, `<l2>`, `<l3>`: length-field delimiter tokens
- The trailing `,<` is **the cue**: it tells the model the next token to
  generate is the classification label (`1` = interacting, `0` = not).
  Don't omit it.

A ready-made example is shipped with the repo:
[`MED4-PPIs-low-confidence_ppiGPLM_prompts.csv`](MED4-PPIs-low-confidence_ppiGPLM_prompts.csv)
β€” inspect or copy its format when building your own input file.

### Batch Inference

Run inference on a file of prompts:

```bash
python sample_fasta3.3_softmax_error_handling3e.py \
    --input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
    --output_dir ppi_results \
    --output_prefix my_predictions
```

This produces:
- `*_classifications.txt`: Full model output in FASTA-like format
- `*_probabilities.csv`: Per-pair probabilities for class 1 and class 0

#### About the inference script

`sample_fasta3.3_softmax_error_handling3e.py` is derived from Karpathy's
nanoGPT `sample.py` β€” it reuses the same `GPTConfig`/`GPT` classes from
`model.py`, the `init_from = 'resume'` checkpoint-loading idiom, and the
`_orig_mod.` prefix strip for `torch.compile`-wrapped state dicts. It is
**not** a drop-in copy, however. The modifications make it a batch
classifier rather than a generic sampler:

- batch input: one prompt per line read from `--input_file`, processed
  sequentially with no interactive loop;
- classifier-style output: per-prompt softmax probabilities of the next
  token being `"1"` vs `"0"`, written to `*_probabilities.csv` alongside
  the conventional `generate()` output dump in `*_classifications.txt`;
- robustness against real-world inputs: automatic block-size detection
  (`checkpoint['model_args']['block_size']` or
  `model.config.n_positions`), head-clipping when a prompt exceeds the
  context window so the trailing `<` label-cue token survives, and
  out-of-vocabulary character replacement (defaults to `A`).

The lineage is **GPT-2 β†’ nanoGPT β†’ ppiGPLM's batch-classifier sampler**.

### Training

#### Prepare data

```bash
python data/MED4_char/prepare.py
```

This creates `train.bin`, `val.bin`, and `meta.pkl` from the input training data.

#### Train the model

```bash
# Single GPU
python train_.py config/train_par_gpt2-s_scratch.py

# Multi-GPU (2 GPUs)
torchrun --standalone --nproc_per_node=2 train_.py config/train_par_gpt2-s_scratch.py
```

### Learning Efficiency Score (LES) Evaluation

The LES-wrapper automates evaluation across multiple training checkpoints, computing ROC-AUC, F1, and optimal threshold at each checkpoint and deriving integrated Learning Efficiency Scores:

```bash
python LES-wrapper.py \
    --checkpoint_dir out \
    --prs_file PRS.txt \
    --rrs_file RRS.txt \
    --output_dir LES_results \
    --vanilla
```

See [LES-wrapper.md](LES-wrapper.md) for full documentation.

### Standalone ROC Analysis

```bash
python roc_analysis_color_threshold_F1e.py \
    --prs_file ppi_results/PRS_probabilities.csv \
    --rrs_file ppi_results/RRS_probabilities.csv
```

## Architecture Diagrams

The ASCII workflow diagram (`assets/ppiGPLM.png`) covers:
- **A.** Prompt-based input strategy (character-level tokenization)
- **B.** Model architecture (GPT-2 small, causal self-attention)
- **C.** Training pipeline
- **D.** Inference pipeline with LES evaluation

> Note: the diagram lists "Flash Attention" β€” this path is taken automatically
> when running on PyTorch β‰₯ 2.0; older versions fall back to the manual
> scaled-dot-product implementation. Numerical results are equivalent.

See `assets/tri_model_consensus.svg` for the tri-model consensus framework with [ppiDCE](https://github.com/kouroshSA/ppiDCE) and [ppiBTEP](https://github.com/kouroshSA/ppiBTEP).

## Citation

If you use this software, please cite:

```
Daakour, S. et al. (2026).
```

This software is built on nanoGPT:

```
Karpathy, A. (2022). nanoGPT. https://github.com/karpathy/nanoGPT
```

## License

This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.

The original nanoGPT framework is by Andrej Karpathy (MIT License, 2022). Modifications and additions for protein-protein interaction prediction are by Kourosh Salehi-Ashtiani (MIT License, 2026).