Commit ·
9fe28a9
1
Parent(s): 01e3e14
Update model card: clarify explainability scope and ppiGPT authorship
Browse files
README.md
CHANGED
|
@@ -2,73 +2,49 @@
|
|
| 2 |
language: en
|
| 3 |
license: mit
|
| 4 |
tags:
|
| 5 |
-
- protein-protein-interaction
|
| 6 |
-
- interpretability
|
| 7 |
- explainability
|
| 8 |
-
-
|
| 9 |
-
-
|
| 10 |
- deeplift
|
|
|
|
| 11 |
- captum
|
|
|
|
|
|
|
| 12 |
- pytorch
|
| 13 |
library_name: pytorch
|
| 14 |
pipeline_tag: other
|
| 15 |
---
|
| 16 |
|
| 17 |
-
#
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|-----------|-------|
|
| 25 |
-
| Architecture | GPT-2 (decoder-only transformer) |
|
| 26 |
-
| Layers | 12 |
|
| 27 |
-
| Attention heads | 12 |
|
| 28 |
-
| Embedding dimension | 768 |
|
| 29 |
-
| Total parameters | ~84.98M |
|
| 30 |
-
| Vocabulary | 29 tokens (20 amino acids + 9 special tokens) |
|
| 31 |
-
| Max sequence length | 1024 tokens |
|
| 32 |
-
| Training epochs | 3 |
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
| File |
|
| 37 |
-
|------|-------------
|
| 38 |
-
| `
|
| 39 |
-
| `
|
| 40 |
-
| `results/deeplift_motif_analysis_results.pkl` | DeepLift attributions for all 2,168 protein pairs | 78 MB |
|
| 41 |
-
| `results/integrated_gradients_random_ppi_per_token_attributions.csv` | Per-token Integrated Gradients attributions | 174 MB |
|
| 42 |
|
| 43 |
-
##
|
| 44 |
|
| 45 |
-
The model was
|
| 46 |
|
| 47 |
-
|
| 48 |
-
-
|
|
|
|
|
|
|
| 49 |
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
Protein pairs are encoded as character-level sequences:
|
| 53 |
-
|
| 54 |
-
```
|
| 55 |
-
<ps1>,PROTEIN_SEQUENCE_1,<ps2>,PROTEIN_SEQUENCE_2,<
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
The model predicts interaction probability via softmax over output tokens at positions 25-26.
|
| 59 |
-
|
| 60 |
-
## Intended Use
|
| 61 |
-
|
| 62 |
-
This model and its associated interpretability outputs accompany a manuscript on *Prochlorococcus* MED4 interactome analysis. The repository provides tools for understanding what sequence features drive the model's interaction predictions, including DeepLift attribution, Integrated Gradients, attention analysis, and alanine substitution scanning.
|
| 63 |
|
| 64 |
## Code Repository
|
| 65 |
|
| 66 |
-
Analysis scripts
|
| 67 |
-
|
| 68 |
-
## Framework
|
| 69 |
-
|
| 70 |
-
- PyTorch >= 2.0.0
|
| 71 |
-
- Captum (for DeepLift and Integrated Gradients)
|
| 72 |
|
| 73 |
## License
|
| 74 |
|
|
|
|
| 2 |
language: en
|
| 3 |
license: mit
|
| 4 |
tags:
|
|
|
|
|
|
|
| 5 |
- explainability
|
| 6 |
+
- interpretability
|
| 7 |
+
- protein-protein-interaction
|
| 8 |
- deeplift
|
| 9 |
+
- integrated-gradients
|
| 10 |
- captum
|
| 11 |
+
- prochlorococcus
|
| 12 |
+
- cyanobacteria
|
| 13 |
- pytorch
|
| 14 |
library_name: pytorch
|
| 15 |
pipeline_tag: other
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# Explainability Analysis of ppiGPT Interaction Predictions in *Prochlorococcus* MED4
|
| 19 |
|
| 20 |
+
This repository hosts large result files and the model checkpoint used in the explainability analyses described in the accompanying manuscript. Analysis code and source data are in the companion GitHub repository.
|
| 21 |
|
| 22 |
+
## What This Repository Contains
|
| 23 |
|
| 24 |
+
### Explainability Results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
These files are the outputs of interpretability analyses applied to ppiGPT predictions:
|
| 27 |
|
| 28 |
+
| File | Size | Description |
|
| 29 |
+
|------|------|-------------|
|
| 30 |
+
| `results/deeplift_motif_analysis_results.pkl` | 78 MB | Captum DeepLift per-residue attribution scores, motif discovery results, and position-wise statistics for all 2,168 protein pairs (1,084 PRS + 1,084 RRS) |
|
| 31 |
+
| `results/integrated_gradients_random_ppi_per_token_attributions.csv` | 174 MB | Captum Integrated Gradients per-token attribution scores for the 1,084 random reference set pairs |
|
|
|
|
|
|
|
| 32 |
|
| 33 |
+
### ppiGPT Model Checkpoint (for reproducibility)
|
| 34 |
|
| 35 |
+
The ppiGPT model was created by **Kourosh Salehi-Ashtiani** and is included here solely to enable reproduction of the explainability analyses. It is not a product of the explainability work.
|
| 36 |
|
| 37 |
+
| File | Size | Description |
|
| 38 |
+
|------|------|-------------|
|
| 39 |
+
| `model/out_3e/ckpt.pt` | 1.0 GB | ppiGPT model checkpoint (3 epochs) |
|
| 40 |
+
| `model/data/meta.pkl` | 343 B | Character-level tokenizer metadata (29-token vocabulary) |
|
| 41 |
|
| 42 |
+
**ppiGPT architecture:** GPT-2 decoder-only transformer, 12 layers, 12 attention heads, 768 embedding dimensions, ~84.98M parameters. Trained from scratch on *Prochlorococcus* MED4 protein sequences with a 29-token character-level vocabulary (20 amino acids + 9 special tokens).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## Code Repository
|
| 45 |
|
| 46 |
+
Analysis scripts, source datasets, publication figures, and documentation:
|
| 47 |
+
https://github.com/olympus-terminal/Prochlorococcus_interactome_model_explainability
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
## License
|
| 50 |
|