ppiGPLM / README.md

Quick-start: document how to wire the downloaded ckpt into the inference script (model_dir + hardcoded ckpt.pt rename, configurator override, meta.pkl path)

a6dfb61 verified 7 days ago

preview code

raw

history blame contribute delete

9.66 kB

	---
	license: mit
	library_name: pytorch
	tags:
	- protein-protein-interaction
	- ppi
	- protein-language-model
	- gpt-2
	- nanogpt
	- character-level
	- trained-from-scratch
	- bioinformatics
	- biology
	pipeline_tag: text-generation
	---

	# ppiGPLM

	A GPT-2 small protein language model trained from scratch on protein-pair prompts and used for binary protein-protein interaction (PPI) classification via next-token prediction. The implementation is based on [nanoGPT](https://github.com/karpathy/nanoGPT) by Andrej Karpathy, with character-level tokenization over amino acids.

	![ppiGPLM](assets/ppiGPLM.png)

	## Overview

	ppiGPLM uses a GPT-2 small architecture (12 layers, 12 attention heads, 768 embedding dimensions) with character-level tokenization to predict whether two proteins interact. Rather than using a separate classification head, ppiGPLM frames PPI prediction as next-token prediction: given a structured prompt encoding a protein pair, the model predicts a binary label (`0` or `1`) as the next token. Softmax probabilities over the label tokens provide continuous interaction scores.

	The model was developed for the Prochlorococcus marinus MED4 interactome, where it serves as one component of a tri-model consensus framework for computational PPI screening.

	## Architecture

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| GPT-2 small \|
	\| Layers \| 12 \|
	\| Attention heads \| 12 \|
	\| Embedding dimension \| 768 \|
	\| Context length \| 4,096 tokens \|
	\| Tokenization \| Character-level (one token per amino acid) \|
	\| Dropout \| 0.2 \|
	\| Optimizer \| AdamW (lr = 5e-4, beta2 = 0.99) \|
	\| Training iterations \| 8,000 \|

	## Installation

	### Prerequisites

	- Python 3.8+
	- CUDA-capable GPU (recommended) or CPU
	- conda (recommended) or pip

	### Setup

	```bash
	# Clone the repository
	git clone https://github.com/kouroshSA/ppiGPLM.git
	cd ppiGPLM

	# Create a conda environment
	conda create -n gpt python=3.10
	conda activate gpt
	pip install -r requirements.txt
	```

	## Repository Structure

	```
	ppiGPLM/
	\|-- model.py # GPT model definition
	\|-- train_.py # Training loop
	\|-- sample_fasta3.3_softmax_error_handling3e.py # Batch inference script
	\|-- LES-wrapper.py # Learning Efficiency Score evaluation wrapper
	\|-- LES-wrapper.md # LES-wrapper documentation
	\|-- roc_analysis_color_threshold_F1e.py # ROC curve analysis
	\|-- configurator.py # Configuration utility
	\|-- config/
	\| \|-- train_par_gpt2-s_scratch.py # Training config (GPT-2 small, from scratch)
	\| +-- finetune_label3.py # Fine-tuning config
	\|-- data/
	\| +-- MED4_char/ # MED4 PPI dataset
	\| \|-- prepare.py # Character-level tokenizer
	\| +-- meta.pkl # Vocabulary (stoi/itos mappings)
	\|-- assets/
	\| \|-- ppiGPLM.png # ASCII workflow diagram
	\| \|-- tri_model_consensus.svg # Tri-model consensus framework (SVG)
	\| +-- tri_model_consensus.png # Tri-model consensus framework (PNG)
	\|-- requirements.txt
	\|-- LICENSE
	+-- README.md
	```

	## Usage

	### Quick start: fetch the checkpoint from Hugging Face

	The released MED4 checkpoint (`checkpoints/ppiGPLM_ckpt_7e.pt`, epoch ≈ 71)
	lives on this Hugging Face repo. To pull it without cloning the GitHub
	mirror:

	```python
	from huggingface_hub import hf_hub_download

	ckpt_path = hf_hub_download(
	repo_id="kouroshSA/ppiGPLM",
	filename="checkpoints/ppiGPLM_ckpt_7e.pt",
	)
	meta_path = hf_hub_download(
	repo_id="kouroshSA/ppiGPLM",
	filename="data/MED4_char/meta.pkl",
	)
	```

	`meta.pkl` carries the character vocabulary (`stoi`/`itos`) the inference
	script needs to encode protein sequences.

	#### Wiring the checkpoint into the inference script

	`sample_fasta3.3_softmax_error_handling3e.py` loads from
	`<model_dir>/ckpt.pt`, where `model_dir = 'out'` is set inline near the
	top of the script and the trailing `ckpt.pt` filename is hardcoded.
	Two ways to make the downloaded file work:

	Option A — place the file where the defaults expect it:

	```bash
	mkdir -p out
	cp /path/to/ppiGPLM_ckpt_7e.pt out/ckpt.pt # note the required rename
	```

	then run the inference command below as-is.

	**Option B — override `model_dir` via the poor-man's configurator
	(`configurator.py`):**

	```bash
	mkdir -p my_ckpts
	cp /path/to/ppiGPLM_ckpt_7e.pt my_ckpts/ckpt.pt # still needs to be ckpt.pt
	python sample_fasta3.3_softmax_error_handling3e.py \
	--input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
	--output_dir ppi_results \
	--output_prefix my_predictions \
	--model_dir=my_ckpts
	```

	Either way, the on-disk filename must be `ckpt.pt`; editing it out of the
	script is also possible (change the `model_dir` default near the top, or
	the literal `'ckpt.pt'` further down) but the rename above is simpler.

	The character vocabulary (`meta.pkl`) is read from
	`data/<dataset>/meta.pkl`, where `<dataset>` comes from
	`checkpoint['config']['dataset']` (`MED4_char` for this checkpoint). Make
	sure that path exists — either keep the `data/MED4_char/` directory from
	the GitHub clone, or place the downloaded `meta_path` there.

	### Input file format

	Each line of `--input_file` is one structured prompt (one protein pair),
	not a free-form FASTA record. The schema is:

	```
	<ps1>,SEQ_A,<ps2>,SEQ_B,<l1>,LEN_A,<l2>,LEN_B,<l3>,<
	```

	- `<ps1>`, `<ps2>`: protein-sequence delimiter tokens
	- `<l1>`, `<l2>`, `<l3>`: length-field delimiter tokens
	- The trailing `,<` is the cue: it tells the model the next token to
	generate is the classification label (`1` = interacting, `0` = not).
	Don't omit it.

	A ready-made example is shipped with the repo:
	[`MED4-PPIs-low-confidence_ppiGPLM_prompts.csv`](MED4-PPIs-low-confidence_ppiGPLM_prompts.csv)
	— inspect or copy its format when building your own input file.

	### Batch Inference

	Run inference on a file of prompts:

	```bash
	python sample_fasta3.3_softmax_error_handling3e.py \
	--input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \
	--output_dir ppi_results \
	--output_prefix my_predictions
	```

	This produces:
	- `*_classifications.txt`: Full model output in FASTA-like format
	- `*_probabilities.csv`: Per-pair probabilities for class 1 and class 0

	#### About the inference script

	`sample_fasta3.3_softmax_error_handling3e.py` is derived from Karpathy's
	nanoGPT `sample.py` — it reuses the same `GPTConfig`/`GPT` classes from
	`model.py`, the `init_from = 'resume'` checkpoint-loading idiom, and the
	`_orig_mod.` prefix strip for `torch.compile`-wrapped state dicts. It is
	not a drop-in copy, however. The modifications make it a batch
	classifier rather than a generic sampler:

	- batch input: one prompt per line read from `--input_file`, processed
	sequentially with no interactive loop;
	- classifier-style output: per-prompt softmax probabilities of the next
	token being `"1"` vs `"0"`, written to `*_probabilities.csv` alongside
	the conventional `generate()` output dump in `*_classifications.txt`;
	- robustness against real-world inputs: automatic block-size detection
	(`checkpoint['model_args']['block_size']` or
	`model.config.n_positions`), head-clipping when a prompt exceeds the
	context window so the trailing `<` label-cue token survives, and
	out-of-vocabulary character replacement (defaults to `A`).

	The lineage is GPT-2 → nanoGPT → ppiGPLM's batch-classifier sampler.

	### Training

	#### Prepare data

	```bash
	python data/MED4_char/prepare.py
	```

	This creates `train.bin`, `val.bin`, and `meta.pkl` from the input training data.

	#### Train the model

	```bash
	# Single GPU
	python train_.py config/train_par_gpt2-s_scratch.py

	# Multi-GPU (2 GPUs)
	torchrun --standalone --nproc_per_node=2 train_.py config/train_par_gpt2-s_scratch.py
	```

	### Learning Efficiency Score (LES) Evaluation

	The LES-wrapper automates evaluation across multiple training checkpoints, computing ROC-AUC, F1, and optimal threshold at each checkpoint and deriving integrated Learning Efficiency Scores:

	```bash
	python LES-wrapper.py \
	--checkpoint_dir out \
	--prs_file PRS.txt \
	--rrs_file RRS.txt \
	--output_dir LES_results \
	--vanilla
	```

	See [LES-wrapper.md](LES-wrapper.md) for full documentation.

	### Standalone ROC Analysis

	```bash
	python roc_analysis_color_threshold_F1e.py \
	--prs_file ppi_results/PRS_probabilities.csv \
	--rrs_file ppi_results/RRS_probabilities.csv
	```

	## Architecture Diagrams

	The ASCII workflow diagram (`assets/ppiGPLM.png`) covers:
	- A. Prompt-based input strategy (character-level tokenization)
	- B. Model architecture (GPT-2 small, causal self-attention)
	- C. Training pipeline
	- D. Inference pipeline with LES evaluation

	> Note: the diagram lists "Flash Attention" — this path is taken automatically
	> when running on PyTorch ≥ 2.0; older versions fall back to the manual
	> scaled-dot-product implementation. Numerical results are equivalent.

	See `assets/tri_model_consensus.svg` for the tri-model consensus framework with [ppiDCE](https://github.com/kouroshSA/ppiDCE) and [ppiBTEP](https://github.com/kouroshSA/ppiBTEP).

	## Citation

	If you use this software, please cite:

	```
	Daakour, S. et al. (2026).
	```

	This software is built on nanoGPT:

	```
	Karpathy, A. (2022). nanoGPT. https://github.com/karpathy/nanoGPT
	```

	## License

	This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.

	The original nanoGPT framework is by Andrej Karpathy (MIT License, 2022). Modifications and additions for protein-protein interaction prediction are by Kourosh Salehi-Ashtiani (MIT License, 2026).