ppiGPLM
A GPT-2 small protein language model trained from scratch on protein-pair prompts and used for binary protein-protein interaction (PPI) classification via next-token prediction. The implementation is based on nanoGPT by Andrej Karpathy, with character-level tokenization over amino acids.
Overview
ppiGPLM uses a GPT-2 small architecture (12 layers, 12 attention heads, 768 embedding dimensions) with character-level tokenization to predict whether two proteins interact. Rather than using a separate classification head, ppiGPLM frames PPI prediction as next-token prediction: given a structured prompt encoding a protein pair, the model predicts a binary label (0 or 1) as the next token. Softmax probabilities over the label tokens provide continuous interaction scores.
The model was developed for the Prochlorococcus marinus MED4 interactome, where it serves as one component of a tri-model consensus framework for computational PPI screening.
Architecture
| Parameter | Value |
|---|---|
| Architecture | GPT-2 small |
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Context length | 4,096 tokens |
| Tokenization | Character-level (one token per amino acid) |
| Dropout | 0.2 |
| Optimizer | AdamW (lr = 5e-4, beta2 = 0.99) |
| Training iterations | 8,000 |
Installation
Prerequisites
- Python 3.8+
- CUDA-capable GPU (recommended) or CPU
- conda (recommended) or pip
Setup
# Clone the repository
git clone https://github.com/kouroshSA/ppiGPLM.git
cd ppiGPLM
# Create a conda environment
conda create -n gpt python=3.10
conda activate gpt
pip install -r requirements.txt
Repository Structure
ppiGPLM/
|-- model.py # GPT model definition
|-- train_.py # Training loop
|-- sample_fasta3.3_softmax_error_handling3e.py # Batch inference script
|-- LES-wrapper.py # Learning Efficiency Score evaluation wrapper
|-- LES-wrapper.md # LES-wrapper documentation
|-- roc_analysis_color_threshold_F1e.py # ROC curve analysis
|-- configurator.py # Configuration utility
|-- config/
| |-- train_par_gpt2-s_scratch.py # Training config (GPT-2 small, from scratch)
| +-- finetune_label3.py # Fine-tuning config
|-- data/
| +-- MED4_char/ # MED4 PPI dataset
| |-- prepare.py # Character-level tokenizer
| +-- meta.pkl # Vocabulary (stoi/itos mappings)
|-- assets/
| |-- ppiGPLM.png # ASCII workflow diagram
| |-- tri_model_consensus.svg # Tri-model consensus framework (SVG)
| +-- tri_model_consensus.png # Tri-model consensus framework (PNG)
|-- requirements.txt
|-- LICENSE
+-- README.md
Usage
Quick start: fetch the checkpoint from Hugging Face
The released MED4 checkpoint (checkpoints/ppiGPLM_ckpt_7e.pt, epoch β 71)
lives on this Hugging Face repo. To pull it without cloning the GitHub
mirror:
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="kouroshSA/ppiGPLM",
filename="checkpoints/ppiGPLM_ckpt_7e.pt",
)
meta_path = hf_hub_download(
repo_id="kouroshSA/ppiGPLM",
filename="data/MED4_char/meta.pkl",
)
meta.pkl carries the character vocabulary (stoi/itos) the inference
script needs to encode protein sequences.
Wiring the checkpoint into the inference script
sample_fasta3.3_softmax_error_handling3e.py loads from
<model_dir>/ckpt.pt, where model_dir = 'out' is set inline near the
top of the script and the trailing ckpt.pt filename is hardcoded.
Two ways to make the downloaded file work:
Option A β place the file where the defaults expect it:
mkdir -p out
cp /path/to/ppiGPLM_ckpt_7e.pt out/ckpt.pt # note the required rename
then run the inference command below as-is.
Option B β override model_dir via the poor-man's configurator
(configurator.py):
mkdir -p my_ckpts
cp /path/to/ppiGPLM_ckpt_7e.pt my_ckpts/ckpt.pt # still needs to be ckpt.pt
python sample_fasta3.3_softmax_error_handling3e.py \
--input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
--output_dir ppi_results \
--output_prefix my_predictions \
--model_dir=my_ckpts
Either way, the on-disk filename must be ckpt.pt; editing it out of the
script is also possible (change the model_dir default near the top, or
the literal 'ckpt.pt' further down) but the rename above is simpler.
The character vocabulary (meta.pkl) is read from
data/<dataset>/meta.pkl, where <dataset> comes from
checkpoint['config']['dataset'] (MED4_char for this checkpoint). Make
sure that path exists β either keep the data/MED4_char/ directory from
the GitHub clone, or place the downloaded meta_path there.
Input file format
Each line of --input_file is one structured prompt (one protein pair),
not a free-form FASTA record. The schema is:
<ps1>,SEQ_A,<ps2>,SEQ_B,<l1>,LEN_A,<l2>,LEN_B,<l3>,<
<ps1>,<ps2>: protein-sequence delimiter tokens<l1>,<l2>,<l3>: length-field delimiter tokens- The trailing
,<is the cue: it tells the model the next token to generate is the classification label (1= interacting,0= not). Don't omit it.
A ready-made example is shipped with the repo:
MED4-PPIs-low-confidence_ppiGPLM_prompts.csv
β inspect or copy its format when building your own input file.
Batch Inference
Run inference on a file of prompts:
python sample_fasta3.3_softmax_error_handling3e.py \
--input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
--output_dir ppi_results \
--output_prefix my_predictions
This produces:
*_classifications.txt: Full model output in FASTA-like format*_probabilities.csv: Per-pair probabilities for class 1 and class 0
About the inference script
sample_fasta3.3_softmax_error_handling3e.py is derived from Karpathy's
nanoGPT sample.py β it reuses the same GPTConfig/GPT classes from
model.py, the init_from = 'resume' checkpoint-loading idiom, and the
_orig_mod. prefix strip for torch.compile-wrapped state dicts. It is
not a drop-in copy, however. The modifications make it a batch
classifier rather than a generic sampler:
- batch input: one prompt per line read from
--input_file, processed sequentially with no interactive loop; - classifier-style output: per-prompt softmax probabilities of the next
token being
"1"vs"0", written to*_probabilities.csvalongside the conventionalgenerate()output dump in*_classifications.txt; - robustness against real-world inputs: automatic block-size detection
(
checkpoint['model_args']['block_size']ormodel.config.n_positions), head-clipping when a prompt exceeds the context window so the trailing<label-cue token survives, and out-of-vocabulary character replacement (defaults toA).
The lineage is GPT-2 β nanoGPT β ppiGPLM's batch-classifier sampler.
Training
Prepare data
python data/MED4_char/prepare.py
This creates train.bin, val.bin, and meta.pkl from the input training data.
Train the model
# Single GPU
python train_.py config/train_par_gpt2-s_scratch.py
# Multi-GPU (2 GPUs)
torchrun --standalone --nproc_per_node=2 train_.py config/train_par_gpt2-s_scratch.py
Learning Efficiency Score (LES) Evaluation
The LES-wrapper automates evaluation across multiple training checkpoints, computing ROC-AUC, F1, and optimal threshold at each checkpoint and deriving integrated Learning Efficiency Scores:
python LES-wrapper.py \
--checkpoint_dir out \
--prs_file PRS.txt \
--rrs_file RRS.txt \
--output_dir LES_results \
--vanilla
See LES-wrapper.md for full documentation.
Standalone ROC Analysis
python roc_analysis_color_threshold_F1e.py \
--prs_file ppi_results/PRS_probabilities.csv \
--rrs_file ppi_results/RRS_probabilities.csv
Architecture Diagrams
The ASCII workflow diagram (assets/ppiGPLM.png) covers:
- A. Prompt-based input strategy (character-level tokenization)
- B. Model architecture (GPT-2 small, causal self-attention)
- C. Training pipeline
- D. Inference pipeline with LES evaluation
Note: the diagram lists "Flash Attention" β this path is taken automatically when running on PyTorch β₯ 2.0; older versions fall back to the manual scaled-dot-product implementation. Numerical results are equivalent.
See assets/tri_model_consensus.svg for the tri-model consensus framework with ppiDCE and ppiBTEP.
Citation
If you use this software, please cite:
Daakour, S. et al. (2026).
This software is built on nanoGPT:
Karpathy, A. (2022). nanoGPT. https://github.com/karpathy/nanoGPT
License
This project is licensed under the MIT License. See LICENSE for details.
The original nanoGPT framework is by Andrej Karpathy (MIT License, 2022). Modifications and additions for protein-protein interaction prediction are by Kourosh Salehi-Ashtiani (MIT License, 2026).
