ppiGPLM / README.md

Quick-start: document how to wire the downloaded ckpt into the inference script (model_dir + hardcoded ckpt.pt rename, configurator override, meta.pkl path)

a6dfb61 verified 7 days ago

preview code

raw

history blame contribute delete

9.66 kB

metadata

license: mit
library_name: pytorch
tags:
  - protein-protein-interaction
  - ppi
  - protein-language-model
  - gpt-2
  - nanogpt
  - character-level
  - trained-from-scratch
  - bioinformatics
  - biology
pipeline_tag: text-generation

ppiGPLM

A GPT-2 small protein language model trained from scratch on protein-pair prompts and used for binary protein-protein interaction (PPI) classification via next-token prediction. The implementation is based on nanoGPT by Andrej Karpathy, with character-level tokenization over amino acids.

Overview

ppiGPLM uses a GPT-2 small architecture (12 layers, 12 attention heads, 768 embedding dimensions) with character-level tokenization to predict whether two proteins interact. Rather than using a separate classification head, ppiGPLM frames PPI prediction as next-token prediction: given a structured prompt encoding a protein pair, the model predicts a binary label (0 or 1) as the next token. Softmax probabilities over the label tokens provide continuous interaction scores.

The model was developed for the Prochlorococcus marinus MED4 interactome, where it serves as one component of a tri-model consensus framework for computational PPI screening.

Architecture

Parameter	Value
Architecture	GPT-2 small
Layers	12
Attention heads	12
Embedding dimension	768
Context length	4,096 tokens
Tokenization	Character-level (one token per amino acid)
Dropout	0.2
Optimizer	AdamW (lr = 5e-4, beta2 = 0.99)
Training iterations	8,000

Installation

Prerequisites

Python 3.8+
CUDA-capable GPU (recommended) or CPU
conda (recommended) or pip

Setup

# Clone the repository
git clone https://github.com/kouroshSA/ppiGPLM.git
cd ppiGPLM

# Create a conda environment
conda create -n gpt python=3.10
conda activate gpt
pip install -r requirements.txt

Repository Structure

ppiGPLM/
|-- model.py                          # GPT model definition
|-- train_.py                         # Training loop
|-- sample_fasta3.3_softmax_error_handling3e.py  # Batch inference script
|-- LES-wrapper.py                    # Learning Efficiency Score evaluation wrapper
|-- LES-wrapper.md                    # LES-wrapper documentation
|-- roc_analysis_color_threshold_F1e.py  # ROC curve analysis
|-- configurator.py                   # Configuration utility
|-- config/
|   |-- train_par_gpt2-s_scratch.py   # Training config (GPT-2 small, from scratch)
|   +-- finetune_label3.py            # Fine-tuning config
|-- data/
|   +-- MED4_char/                    # MED4 PPI dataset
|       |-- prepare.py                # Character-level tokenizer
|       +-- meta.pkl                  # Vocabulary (stoi/itos mappings)
|-- assets/
|   |-- ppiGPLM.png                  # ASCII workflow diagram
|   |-- tri_model_consensus.svg      # Tri-model consensus framework (SVG)
|   +-- tri_model_consensus.png      # Tri-model consensus framework (PNG)
|-- requirements.txt
|-- LICENSE
+-- README.md

Usage

Quick start: fetch the checkpoint from Hugging Face

The released MED4 checkpoint (checkpoints/ppiGPLM_ckpt_7e.pt, epoch ≈ 71) lives on this Hugging Face repo. To pull it without cloning the GitHub mirror:

from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="kouroshSA/ppiGPLM",
    filename="checkpoints/ppiGPLM_ckpt_7e.pt",
)
meta_path = hf_hub_download(
    repo_id="kouroshSA/ppiGPLM",
    filename="data/MED4_char/meta.pkl",
)

meta.pkl carries the character vocabulary (stoi/itos) the inference script needs to encode protein sequences.

Wiring the checkpoint into the inference script

sample_fasta3.3_softmax_error_handling3e.py loads from <model_dir>/ckpt.pt, where model_dir = 'out' is set inline near the top of the script and the trailing ckpt.pt filename is hardcoded. Two ways to make the downloaded file work:

Option A — place the file where the defaults expect it:

mkdir -p out
cp /path/to/ppiGPLM_ckpt_7e.pt out/ckpt.pt   # note the required rename

then run the inference command below as-is.

Option B — override model_dir via the poor-man's configurator (configurator.py):

mkdir -p my_ckpts
cp /path/to/ppiGPLM_ckpt_7e.pt my_ckpts/ckpt.pt   # still needs to be ckpt.pt
python sample_fasta3.3_softmax_error_handling3e.py \
    --input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
    --output_dir ppi_results \
    --output_prefix my_predictions \
    --model_dir=my_ckpts

Either way, the on-disk filename must be ckpt.pt; editing it out of the script is also possible (change the model_dir default near the top, or the literal 'ckpt.pt' further down) but the rename above is simpler.

The character vocabulary (meta.pkl) is read from data/<dataset>/meta.pkl, where <dataset> comes from checkpoint['config']['dataset'] (MED4_char for this checkpoint). Make sure that path exists — either keep the data/MED4_char/ directory from the GitHub clone, or place the downloaded meta_path there.

Input file format

Each line of --input_file is one structured prompt (one protein pair), not a free-form FASTA record. The schema is:

<ps1>,SEQ_A,<ps2>,SEQ_B,<l1>,LEN_A,<l2>,LEN_B,<l3>,<

<ps1>, <ps2>: protein-sequence delimiter tokens
<l1>, <l2>, <l3>: length-field delimiter tokens
The trailing ,< is the cue: it tells the model the next token to generate is the classification label (1 = interacting, 0 = not). Don't omit it.

A ready-made example is shipped with the repo: MED4-PPIs-low-confidence_ppiGPLM_prompts.csv — inspect or copy its format when building your own input file.

Batch Inference

Run inference on a file of prompts:

python sample_fasta3.3_softmax_error_handling3e.py \
    --input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
    --output_dir ppi_results \
    --output_prefix my_predictions

This produces:

*_classifications.txt: Full model output in FASTA-like format
*_probabilities.csv: Per-pair probabilities for class 1 and class 0

About the inference script

sample_fasta3.3_softmax_error_handling3e.py is derived from Karpathy's nanoGPT sample.py — it reuses the same GPTConfig/GPT classes from model.py, the init_from = 'resume' checkpoint-loading idiom, and the _orig_mod. prefix strip for torch.compile-wrapped state dicts. It is not a drop-in copy, however. The modifications make it a batch classifier rather than a generic sampler:

batch input: one prompt per line read from --input_file, processed sequentially with no interactive loop;
classifier-style output: per-prompt softmax probabilities of the next token being "1" vs "0", written to *_probabilities.csv alongside the conventional generate() output dump in *_classifications.txt;
robustness against real-world inputs: automatic block-size detection (checkpoint['model_args']['block_size'] or model.config.n_positions), head-clipping when a prompt exceeds the context window so the trailing < label-cue token survives, and out-of-vocabulary character replacement (defaults to A).

The lineage is GPT-2 → nanoGPT → ppiGPLM's batch-classifier sampler.

Training

Prepare data

python data/MED4_char/prepare.py

This creates train.bin, val.bin, and meta.pkl from the input training data.

Train the model

# Single GPU
python train_.py config/train_par_gpt2-s_scratch.py

# Multi-GPU (2 GPUs)
torchrun --standalone --nproc_per_node=2 train_.py config/train_par_gpt2-s_scratch.py

Learning Efficiency Score (LES) Evaluation

The LES-wrapper automates evaluation across multiple training checkpoints, computing ROC-AUC, F1, and optimal threshold at each checkpoint and deriving integrated Learning Efficiency Scores:

python LES-wrapper.py \
    --checkpoint_dir out \
    --prs_file PRS.txt \
    --rrs_file RRS.txt \
    --output_dir LES_results \
    --vanilla

See LES-wrapper.md for full documentation.

Standalone ROC Analysis

python roc_analysis_color_threshold_F1e.py \
    --prs_file ppi_results/PRS_probabilities.csv \
    --rrs_file ppi_results/RRS_probabilities.csv

Architecture Diagrams

The ASCII workflow diagram (assets/ppiGPLM.png) covers:

A. Prompt-based input strategy (character-level tokenization)
B. Model architecture (GPT-2 small, causal self-attention)
C. Training pipeline
D. Inference pipeline with LES evaluation

Note: the diagram lists "Flash Attention" — this path is taken automatically when running on PyTorch ≥ 2.0; older versions fall back to the manual scaled-dot-product implementation. Numerical results are equivalent.

See assets/tri_model_consensus.svg for the tri-model consensus framework with ppiDCE and ppiBTEP.

Citation

If you use this software, please cite:

Daakour, S. et al. (2026).

This software is built on nanoGPT:

Karpathy, A. (2022). nanoGPT. https://github.com/karpathy/nanoGPT

License

This project is licensed under the MIT License. See LICENSE for details.

The original nanoGPT framework is by Andrej Karpathy (MIT License, 2022). Modifications and additions for protein-protein interaction prediction are by Kourosh Salehi-Ashtiani (MIT License, 2026).