File size: 9,663 Bytes

---
license: mit
library_name: pytorch
tags:
  - protein-protein-interaction
  - ppi
  - protein-language-model
  - gpt-2
  - nanogpt
  - character-level
  - trained-from-scratch
  - bioinformatics
  - biology
pipeline_tag: text-generation
---

# ppiGPLM

A GPT-2 small protein language model trained from scratch on protein-pair prompts and used for binary protein-protein interaction (PPI) classification via next-token prediction. The implementation is based on [nanoGPT](https://github.com/karpathy/nanoGPT) by Andrej Karpathy, with character-level tokenization over amino acids.

![ppiGPLM](assets/ppiGPLM.png)

## Overview

ppiGPLM uses a GPT-2 small architecture (12 layers, 12 attention heads, 768 embedding dimensions) with character-level tokenization to predict whether two proteins interact. Rather than using a separate classification head, ppiGPLM frames PPI prediction as next-token prediction: given a structured prompt encoding a protein pair, the model predicts a binary label (`0` or `1`) as the next token. Softmax probabilities over the label tokens provide continuous interaction scores.

The model was developed for the *Prochlorococcus marinus* MED4 interactome, where it serves as one component of a tri-model consensus framework for computational PPI screening.

## Architecture

| Parameter | Value |
|-----------|-------|
| Architecture | GPT-2 small |
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Context length | 4,096 tokens |
| Tokenization | Character-level (one token per amino acid) |
| Dropout | 0.2 |
| Optimizer | AdamW (lr = 5e-4, beta2 = 0.99) |
| Training iterations | 8,000 |

## Installation

### Prerequisites

- Python 3.8+
- CUDA-capable GPU (recommended) or CPU
- conda (recommended) or pip

### Setup

```bash
# Clone the repository
git clone https://github.com/kouroshSA/ppiGPLM.git
cd ppiGPLM

# Create a conda environment
conda create -n gpt python=3.10
conda activate gpt
pip install -r requirements.txt
```

## Repository Structure

```
ppiGPLM/
|-- model.py                          # GPT model definition
|-- train_.py                         # Training loop
|-- sample_fasta3.3_softmax_error_handling3e.py  # Batch inference script
|-- LES-wrapper.py                    # Learning Efficiency Score evaluation wrapper
|-- LES-wrapper.md                    # LES-wrapper documentation
|-- roc_analysis_color_threshold_F1e.py  # ROC curve analysis
|-- configurator.py                   # Configuration utility
|-- config/
|   |-- train_par_gpt2-s_scratch.py   # Training config (GPT-2 small, from scratch)
|   +-- finetune_label3.py            # Fine-tuning config
|-- data/
|   +-- MED4_char/                    # MED4 PPI dataset
|       |-- prepare.py                # Character-level tokenizer
|       +-- meta.pkl                  # Vocabulary (stoi/itos mappings)
|-- assets/
|   |-- ppiGPLM.png                  # ASCII workflow diagram
|   |-- tri_model_consensus.svg      # Tri-model consensus framework (SVG)
|   +-- tri_model_consensus.png      # Tri-model consensus framework (PNG)
|-- requirements.txt
|-- LICENSE
+-- README.md
```

## Usage

### Quick start: fetch the checkpoint from Hugging Face

The released MED4 checkpoint (`checkpoints/ppiGPLM_ckpt_7e.pt`, epoch ≈ 71)
lives on this Hugging Face repo. To pull it without cloning the GitHub
mirror:

```python
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="kouroshSA/ppiGPLM",
    filename="checkpoints/ppiGPLM_ckpt_7e.pt",
)
meta_path = hf_hub_download(
    repo_id="kouroshSA/ppiGPLM",
    filename="data/MED4_char/meta.pkl",
)
```

`meta.pkl` carries the character vocabulary (`stoi`/`itos`) the inference
script needs to encode protein sequences.

#### Wiring the checkpoint into the inference script

`sample_fasta3.3_softmax_error_handling3e.py` loads from
`<model_dir>/ckpt.pt`, where `model_dir = 'out'` is set inline near the
top of the script and the trailing `ckpt.pt` filename is **hardcoded**.
Two ways to make the downloaded file work:

**Option A — place the file where the defaults expect it:**

```bash
mkdir -p out
cp /path/to/ppiGPLM_ckpt_7e.pt out/ckpt.pt   # note the required rename
```

then run the inference command below as-is.

**Option B — override `model_dir` via the poor-man's configurator
(`configurator.py`):**

```bash
mkdir -p my_ckpts
cp /path/to/ppiGPLM_ckpt_7e.pt my_ckpts/ckpt.pt   # still needs to be ckpt.pt
python sample_fasta3.3_softmax_error_handling3e.py \
    --input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
    --output_dir ppi_results \
    --output_prefix my_predictions \
    --model_dir=my_ckpts
```

Either way, the on-disk filename must be `ckpt.pt`; editing it out of the
script is also possible (change the `model_dir` default near the top, or
the literal `'ckpt.pt'` further down) but the rename above is simpler.

The character vocabulary (`meta.pkl`) is read from
`data/<dataset>/meta.pkl`, where `<dataset>` comes from
`checkpoint['config']['dataset']` (`MED4_char` for this checkpoint). Make
sure that path exists — either keep the `data/MED4_char/` directory from
the GitHub clone, or place the downloaded `meta_path` there.

### Input file format

Each line of `--input_file` is one structured prompt (one protein pair),
not a free-form FASTA record. The schema is:

```
<ps1>,SEQ_A,<ps2>,SEQ_B,<l1>,LEN_A,<l2>,LEN_B,<l3>,<
```

- `<ps1>`, `<ps2>`: protein-sequence delimiter tokens
- `<l1>`, `<l2>`, `<l3>`: length-field delimiter tokens
- The trailing `,<` is **the cue**: it tells the model the next token to
  generate is the classification label (`1` = interacting, `0` = not).
  Don't omit it.

A ready-made example is shipped with the repo:
[`MED4-PPIs-low-confidence_ppiGPLM_prompts.csv`](MED4-PPIs-low-confidence_ppiGPLM_prompts.csv)
— inspect or copy its format when building your own input file.

### Batch Inference

Run inference on a file of prompts:

```bash
python sample_fasta3.3_softmax_error_handling3e.py \
    --input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
    --output_dir ppi_results \
    --output_prefix my_predictions
```

This produces:
- `*_classifications.txt`: Full model output in FASTA-like format
- `*_probabilities.csv`: Per-pair probabilities for class 1 and class 0

#### About the inference script

`sample_fasta3.3_softmax_error_handling3e.py` is derived from Karpathy's
nanoGPT `sample.py` — it reuses the same `GPTConfig`/`GPT` classes from
`model.py`, the `init_from = 'resume'` checkpoint-loading idiom, and the
`_orig_mod.` prefix strip for `torch.compile`-wrapped state dicts. It is
**not** a drop-in copy, however. The modifications make it a batch
classifier rather than a generic sampler:

- batch input: one prompt per line read from `--input_file`, processed
  sequentially with no interactive loop;
- classifier-style output: per-prompt softmax probabilities of the next
  token being `"1"` vs `"0"`, written to `*_probabilities.csv` alongside
  the conventional `generate()` output dump in `*_classifications.txt`;
- robustness against real-world inputs: automatic block-size detection
  (`checkpoint['model_args']['block_size']` or
  `model.config.n_positions`), head-clipping when a prompt exceeds the
  context window so the trailing `<` label-cue token survives, and
  out-of-vocabulary character replacement (defaults to `A`).

The lineage is **GPT-2 → nanoGPT → ppiGPLM's batch-classifier sampler**.

### Training

#### Prepare data

```bash
python data/MED4_char/prepare.py
```

This creates `train.bin`, `val.bin`, and `meta.pkl` from the input training data.

#### Train the model

```bash
# Single GPU
python train_.py config/train_par_gpt2-s_scratch.py

# Multi-GPU (2 GPUs)
torchrun --standalone --nproc_per_node=2 train_.py config/train_par_gpt2-s_scratch.py
```

### Learning Efficiency Score (LES) Evaluation

The LES-wrapper automates evaluation across multiple training checkpoints, computing ROC-AUC, F1, and optimal threshold at each checkpoint and deriving integrated Learning Efficiency Scores:

```bash
python LES-wrapper.py \
    --checkpoint_dir out \
    --prs_file PRS.txt \
    --rrs_file RRS.txt \
    --output_dir LES_results \
    --vanilla
```

See [LES-wrapper.md](LES-wrapper.md) for full documentation.

### Standalone ROC Analysis

```bash
python roc_analysis_color_threshold_F1e.py \
    --prs_file ppi_results/PRS_probabilities.csv \
    --rrs_file ppi_results/RRS_probabilities.csv
```

## Architecture Diagrams

The ASCII workflow diagram (`assets/ppiGPLM.png`) covers:
- **A.** Prompt-based input strategy (character-level tokenization)
- **B.** Model architecture (GPT-2 small, causal self-attention)
- **C.** Training pipeline
- **D.** Inference pipeline with LES evaluation

> Note: the diagram lists "Flash Attention" — this path is taken automatically
> when running on PyTorch ≥ 2.0; older versions fall back to the manual
> scaled-dot-product implementation. Numerical results are equivalent.

See `assets/tri_model_consensus.svg` for the tri-model consensus framework with [ppiDCE](https://github.com/kouroshSA/ppiDCE) and [ppiBTEP](https://github.com/kouroshSA/ppiBTEP).

## Citation

If you use this software, please cite:

```
Daakour, S. et al. (2026).
```

This software is built on nanoGPT:

```
Karpathy, A. (2022). nanoGPT. https://github.com/karpathy/nanoGPT
```

## License

This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.

The original nanoGPT framework is by Andrej Karpathy (MIT License, 2022). Modifications and additions for protein-protein interaction prediction are by Kourosh Salehi-Ashtiani (MIT License, 2026).