--- license: mit library_name: pytorch tags: - protein-protein-interaction - ppi - protein-language-model - gpt-2 - nanogpt - character-level - trained-from-scratch - bioinformatics - biology pipeline_tag: text-generation --- # ppiGPLM A GPT-2 small protein language model trained from scratch on protein-pair prompts and used for binary protein-protein interaction (PPI) classification via next-token prediction. The implementation is based on [nanoGPT](https://github.com/karpathy/nanoGPT) by Andrej Karpathy, with character-level tokenization over amino acids. ![ppiGPLM](assets/ppiGPLM.png) ## Overview ppiGPLM uses a GPT-2 small architecture (12 layers, 12 attention heads, 768 embedding dimensions) with character-level tokenization to predict whether two proteins interact. Rather than using a separate classification head, ppiGPLM frames PPI prediction as next-token prediction: given a structured prompt encoding a protein pair, the model predicts a binary label (`0` or `1`) as the next token. Softmax probabilities over the label tokens provide continuous interaction scores. The model was developed for the *Prochlorococcus marinus* MED4 interactome, where it serves as one component of a tri-model consensus framework for computational PPI screening. ## Architecture | Parameter | Value | |-----------|-------| | Architecture | GPT-2 small | | Layers | 12 | | Attention heads | 12 | | Embedding dimension | 768 | | Context length | 4,096 tokens | | Tokenization | Character-level (one token per amino acid) | | Dropout | 0.2 | | Optimizer | AdamW (lr = 5e-4, beta2 = 0.99) | | Training iterations | 8,000 | ## Installation ### Prerequisites - Python 3.8+ - CUDA-capable GPU (recommended) or CPU - conda (recommended) or pip ### Setup ```bash # Clone the repository git clone https://github.com/kouroshSA/ppiGPLM.git cd ppiGPLM # Create a conda environment conda create -n gpt python=3.10 conda activate gpt pip install -r requirements.txt ``` ## Repository Structure ``` ppiGPLM/ |-- model.py # GPT model definition |-- train_.py # Training loop |-- sample_fasta3.3_softmax_error_handling3e.py # Batch inference script |-- LES-wrapper.py # Learning Efficiency Score evaluation wrapper |-- LES-wrapper.md # LES-wrapper documentation |-- roc_analysis_color_threshold_F1e.py # ROC curve analysis |-- configurator.py # Configuration utility |-- config/ | |-- train_par_gpt2-s_scratch.py # Training config (GPT-2 small, from scratch) | +-- finetune_label3.py # Fine-tuning config |-- data/ | +-- MED4_char/ # MED4 PPI dataset | |-- prepare.py # Character-level tokenizer | +-- meta.pkl # Vocabulary (stoi/itos mappings) |-- assets/ | |-- ppiGPLM.png # ASCII workflow diagram | |-- tri_model_consensus.svg # Tri-model consensus framework (SVG) | +-- tri_model_consensus.png # Tri-model consensus framework (PNG) |-- requirements.txt |-- LICENSE +-- README.md ``` ## Usage ### Quick start: fetch the checkpoint from Hugging Face The released MED4 checkpoint (`checkpoints/ppiGPLM_ckpt_7e.pt`, epoch ≈ 71) lives on this Hugging Face repo. To pull it without cloning the GitHub mirror: ```python from huggingface_hub import hf_hub_download ckpt_path = hf_hub_download( repo_id="kouroshSA/ppiGPLM", filename="checkpoints/ppiGPLM_ckpt_7e.pt", ) meta_path = hf_hub_download( repo_id="kouroshSA/ppiGPLM", filename="data/MED4_char/meta.pkl", ) ``` `meta.pkl` carries the character vocabulary (`stoi`/`itos`) the inference script needs to encode protein sequences. #### Wiring the checkpoint into the inference script `sample_fasta3.3_softmax_error_handling3e.py` loads from `/ckpt.pt`, where `model_dir = 'out'` is set inline near the top of the script and the trailing `ckpt.pt` filename is **hardcoded**. Two ways to make the downloaded file work: **Option A — place the file where the defaults expect it:** ```bash mkdir -p out cp /path/to/ppiGPLM_ckpt_7e.pt out/ckpt.pt # note the required rename ``` then run the inference command below as-is. **Option B — override `model_dir` via the poor-man's configurator (`configurator.py`):** ```bash mkdir -p my_ckpts cp /path/to/ppiGPLM_ckpt_7e.pt my_ckpts/ckpt.pt # still needs to be ckpt.pt python sample_fasta3.3_softmax_error_handling3e.py \ --input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \ --output_dir ppi_results \ --output_prefix my_predictions \ --model_dir=my_ckpts ``` Either way, the on-disk filename must be `ckpt.pt`; editing it out of the script is also possible (change the `model_dir` default near the top, or the literal `'ckpt.pt'` further down) but the rename above is simpler. The character vocabulary (`meta.pkl`) is read from `data//meta.pkl`, where `` comes from `checkpoint['config']['dataset']` (`MED4_char` for this checkpoint). Make sure that path exists — either keep the `data/MED4_char/` directory from the GitHub clone, or place the downloaded `meta_path` there. ### Input file format Each line of `--input_file` is one structured prompt (one protein pair), not a free-form FASTA record. The schema is: ``` ,SEQ_A,,SEQ_B,,LEN_A,,LEN_B,,< ``` - ``, ``: protein-sequence delimiter tokens - ``, ``, ``: length-field delimiter tokens - The trailing `,<` is **the cue**: it tells the model the next token to generate is the classification label (`1` = interacting, `0` = not). Don't omit it. A ready-made example is shipped with the repo: [`MED4-PPIs-low-confidence_ppiGPLM_prompts.csv`](MED4-PPIs-low-confidence_ppiGPLM_prompts.csv) — inspect or copy its format when building your own input file. ### Batch Inference Run inference on a file of prompts: ```bash python sample_fasta3.3_softmax_error_handling3e.py \ --input_file MED4-PPIs-low-confidence_ppiGPLM_prompts.csv \ --output_dir ppi_results \ --output_prefix my_predictions ``` This produces: - `*_classifications.txt`: Full model output in FASTA-like format - `*_probabilities.csv`: Per-pair probabilities for class 1 and class 0 #### About the inference script `sample_fasta3.3_softmax_error_handling3e.py` is derived from Karpathy's nanoGPT `sample.py` — it reuses the same `GPTConfig`/`GPT` classes from `model.py`, the `init_from = 'resume'` checkpoint-loading idiom, and the `_orig_mod.` prefix strip for `torch.compile`-wrapped state dicts. It is **not** a drop-in copy, however. The modifications make it a batch classifier rather than a generic sampler: - batch input: one prompt per line read from `--input_file`, processed sequentially with no interactive loop; - classifier-style output: per-prompt softmax probabilities of the next token being `"1"` vs `"0"`, written to `*_probabilities.csv` alongside the conventional `generate()` output dump in `*_classifications.txt`; - robustness against real-world inputs: automatic block-size detection (`checkpoint['model_args']['block_size']` or `model.config.n_positions`), head-clipping when a prompt exceeds the context window so the trailing `<` label-cue token survives, and out-of-vocabulary character replacement (defaults to `A`). The lineage is **GPT-2 → nanoGPT → ppiGPLM's batch-classifier sampler**. ### Training #### Prepare data ```bash python data/MED4_char/prepare.py ``` This creates `train.bin`, `val.bin`, and `meta.pkl` from the input training data. #### Train the model ```bash # Single GPU python train_.py config/train_par_gpt2-s_scratch.py # Multi-GPU (2 GPUs) torchrun --standalone --nproc_per_node=2 train_.py config/train_par_gpt2-s_scratch.py ``` ### Learning Efficiency Score (LES) Evaluation The LES-wrapper automates evaluation across multiple training checkpoints, computing ROC-AUC, F1, and optimal threshold at each checkpoint and deriving integrated Learning Efficiency Scores: ```bash python LES-wrapper.py \ --checkpoint_dir out \ --prs_file PRS.txt \ --rrs_file RRS.txt \ --output_dir LES_results \ --vanilla ``` See [LES-wrapper.md](LES-wrapper.md) for full documentation. ### Standalone ROC Analysis ```bash python roc_analysis_color_threshold_F1e.py \ --prs_file ppi_results/PRS_probabilities.csv \ --rrs_file ppi_results/RRS_probabilities.csv ``` ## Architecture Diagrams The ASCII workflow diagram (`assets/ppiGPLM.png`) covers: - **A.** Prompt-based input strategy (character-level tokenization) - **B.** Model architecture (GPT-2 small, causal self-attention) - **C.** Training pipeline - **D.** Inference pipeline with LES evaluation > Note: the diagram lists "Flash Attention" — this path is taken automatically > when running on PyTorch ≥ 2.0; older versions fall back to the manual > scaled-dot-product implementation. Numerical results are equivalent. See `assets/tri_model_consensus.svg` for the tri-model consensus framework with [ppiDCE](https://github.com/kouroshSA/ppiDCE) and [ppiBTEP](https://github.com/kouroshSA/ppiBTEP). ## Citation If you use this software, please cite: ``` Daakour, S. et al. (2026). ``` This software is built on nanoGPT: ``` Karpathy, A. (2022). nanoGPT. https://github.com/karpathy/nanoGPT ``` ## License This project is licensed under the MIT License. See [LICENSE](LICENSE) for details. The original nanoGPT framework is by Andrej Karpathy (MIT License, 2022). Modifications and additions for protein-protein interaction prediction are by Kourosh Salehi-Ashtiani (MIT License, 2026).