GreenGenomicsLab
/

Prochlorococcus_interactome_model_explainability

@@ -2,73 +2,49 @@
 language: en
 license: mit
 tags:
-  - protein-protein-interaction
-  - interpretability
   - explainability
-  - prochlorococcus
-  - cyanobacteria
   - deeplift
   - captum
   - pytorch
 library_name: pytorch
 pipeline_tag: other
 ---
-# ppiGPT — Prochlorococcus MED4 Protein-Protein Interaction Model
-## Model Description
-ppiGPT is a GPT-2 architecture trained from scratch to predict protein-protein interactions in *Prochlorococcus* MED4. The model takes two protein sequences as input and predicts whether they interact.
-| Parameter | Value |
-|-----------|-------|
-| Architecture | GPT-2 (decoder-only transformer) |
-| Layers | 12 |
-| Attention heads | 12 |
-| Embedding dimension | 768 |
-| Total parameters | ~84.98M |
-| Vocabulary | 29 tokens (20 amino acids + 9 special tokens) |
-| Max sequence length | 1024 tokens |
-| Training epochs | 3 |
-## Files
-| File | Description | Size |
-|------|-------------|------|
-| `model/out_3e/ckpt.pt` | Trained model checkpoint | 1.0 GB |
-| `model/data/meta.pkl` | Character-level tokenizer metadata | 343 B |
-| `results/deeplift_motif_analysis_results.pkl` | DeepLift attributions for all 2,168 protein pairs | 78 MB |
-| `results/integrated_gradients_random_ppi_per_token_attributions.csv` | Per-token Integrated Gradients attributions | 174 MB |
-## Training Data
-The model was trained on *Prochlorococcus* MED4 protein sequences with interaction labels derived from yeast two-hybrid (Y2H) experiments:
-- **Positive Reference Set (PRS)**: 1,084 experimentally validated Y2H interactions
-- **Random Reference Set (RRS)**: 1,084 randomly paired MED4 proteins
-## Input Format
-Protein pairs are encoded as character-level sequences:
-```
-<ps1>,PROTEIN_SEQUENCE_1,<ps2>,PROTEIN_SEQUENCE_2,<
-```
-The model predicts interaction probability via softmax over output tokens at positions 25-26.
-## Intended Use
-This model and its associated interpretability outputs accompany a manuscript on *Prochlorococcus* MED4 interactome analysis. The repository provides tools for understanding what sequence features drive the model's interaction predictions, including DeepLift attribution, Integrated Gradients, attention analysis, and alanine substitution scanning.
 ## Code Repository
-Analysis scripts and source data: [github.com/olympus-terminal/Prochlorococcus_interactome_model_explainability](https://github.com/olympus-terminal/Prochlorococcus_interactome_model_explainability)
-## Framework
-- PyTorch >= 2.0.0
-- Captum (for DeepLift and Integrated Gradients)
 ## License

 language: en
 license: mit
 tags:
   - explainability
+  - interpretability
+  - protein-protein-interaction
   - deeplift
+  - integrated-gradients
   - captum
+  - prochlorococcus
+  - cyanobacteria
   - pytorch
 library_name: pytorch
 pipeline_tag: other
 ---
+# Explainability Analysis of ppiGPT Interaction Predictions in *Prochlorococcus* MED4
+This repository hosts large result files and the model checkpoint used in the explainability analyses described in the accompanying manuscript. Analysis code and source data are in the companion GitHub repository.
+## What This Repository Contains
+### Explainability Results
+These files are the outputs of interpretability analyses applied to ppiGPT predictions:
+| File | Size | Description |
+|------|------|-------------|
+| `results/deeplift_motif_analysis_results.pkl` | 78 MB | Captum DeepLift per-residue attribution scores, motif discovery results, and position-wise statistics for all 2,168 protein pairs (1,084 PRS + 1,084 RRS) |
+| `results/integrated_gradients_random_ppi_per_token_attributions.csv` | 174 MB | Captum Integrated Gradients per-token attribution scores for the 1,084 random reference set pairs |
+### ppiGPT Model Checkpoint (for reproducibility)
+The ppiGPT model was created by **Kourosh Salehi-Ashtiani** and is included here solely to enable reproduction of the explainability analyses. It is not a product of the explainability work.
+| File | Size | Description |
+|------|------|-------------|
+| `model/out_3e/ckpt.pt` | 1.0 GB | ppiGPT model checkpoint (3 epochs) |
+| `model/data/meta.pkl` | 343 B | Character-level tokenizer metadata (29-token vocabulary) |
+**ppiGPT architecture:** GPT-2 decoder-only transformer, 12 layers, 12 attention heads, 768 embedding dimensions, ~84.98M parameters. Trained from scratch on *Prochlorococcus* MED4 protein sequences with a 29-token character-level vocabulary (20 amino acids + 9 special tokens).
 ## Code Repository
+Analysis scripts, source datasets, publication figures, and documentation:
+https://github.com/olympus-terminal/Prochlorococcus_interactome_model_explainability
 ## License