GreenGenomicsLab commited on
Commit
9fe28a9
·
1 Parent(s): 01e3e14

Update model card: clarify explainability scope and ppiGPT authorship

Browse files
Files changed (1) hide show
  1. README.md +23 -47
README.md CHANGED
@@ -2,73 +2,49 @@
2
  language: en
3
  license: mit
4
  tags:
5
- - protein-protein-interaction
6
- - interpretability
7
  - explainability
8
- - prochlorococcus
9
- - cyanobacteria
10
  - deeplift
 
11
  - captum
 
 
12
  - pytorch
13
  library_name: pytorch
14
  pipeline_tag: other
15
  ---
16
 
17
- # ppiGPT Prochlorococcus MED4 Protein-Protein Interaction Model
18
 
19
- ## Model Description
20
 
21
- ppiGPT is a GPT-2 architecture trained from scratch to predict protein-protein interactions in *Prochlorococcus* MED4. The model takes two protein sequences as input and predicts whether they interact.
22
 
23
- | Parameter | Value |
24
- |-----------|-------|
25
- | Architecture | GPT-2 (decoder-only transformer) |
26
- | Layers | 12 |
27
- | Attention heads | 12 |
28
- | Embedding dimension | 768 |
29
- | Total parameters | ~84.98M |
30
- | Vocabulary | 29 tokens (20 amino acids + 9 special tokens) |
31
- | Max sequence length | 1024 tokens |
32
- | Training epochs | 3 |
33
 
34
- ## Files
35
 
36
- | File | Description | Size |
37
- |------|-------------|------|
38
- | `model/out_3e/ckpt.pt` | Trained model checkpoint | 1.0 GB |
39
- | `model/data/meta.pkl` | Character-level tokenizer metadata | 343 B |
40
- | `results/deeplift_motif_analysis_results.pkl` | DeepLift attributions for all 2,168 protein pairs | 78 MB |
41
- | `results/integrated_gradients_random_ppi_per_token_attributions.csv` | Per-token Integrated Gradients attributions | 174 MB |
42
 
43
- ## Training Data
44
 
45
- The model was trained on *Prochlorococcus* MED4 protein sequences with interaction labels derived from yeast two-hybrid (Y2H) experiments:
46
 
47
- - **Positive Reference Set (PRS)**: 1,084 experimentally validated Y2H interactions
48
- - **Random Reference Set (RRS)**: 1,084 randomly paired MED4 proteins
 
 
49
 
50
- ## Input Format
51
-
52
- Protein pairs are encoded as character-level sequences:
53
-
54
- ```
55
- <ps1>,PROTEIN_SEQUENCE_1,<ps2>,PROTEIN_SEQUENCE_2,<
56
- ```
57
-
58
- The model predicts interaction probability via softmax over output tokens at positions 25-26.
59
-
60
- ## Intended Use
61
-
62
- This model and its associated interpretability outputs accompany a manuscript on *Prochlorococcus* MED4 interactome analysis. The repository provides tools for understanding what sequence features drive the model's interaction predictions, including DeepLift attribution, Integrated Gradients, attention analysis, and alanine substitution scanning.
63
 
64
  ## Code Repository
65
 
66
- Analysis scripts and source data: [github.com/olympus-terminal/Prochlorococcus_interactome_model_explainability](https://github.com/olympus-terminal/Prochlorococcus_interactome_model_explainability)
67
-
68
- ## Framework
69
-
70
- - PyTorch >= 2.0.0
71
- - Captum (for DeepLift and Integrated Gradients)
72
 
73
  ## License
74
 
 
2
  language: en
3
  license: mit
4
  tags:
 
 
5
  - explainability
6
+ - interpretability
7
+ - protein-protein-interaction
8
  - deeplift
9
+ - integrated-gradients
10
  - captum
11
+ - prochlorococcus
12
+ - cyanobacteria
13
  - pytorch
14
  library_name: pytorch
15
  pipeline_tag: other
16
  ---
17
 
18
+ # Explainability Analysis of ppiGPT Interaction Predictions in *Prochlorococcus* MED4
19
 
20
+ This repository hosts large result files and the model checkpoint used in the explainability analyses described in the accompanying manuscript. Analysis code and source data are in the companion GitHub repository.
21
 
22
+ ## What This Repository Contains
23
 
24
+ ### Explainability Results
 
 
 
 
 
 
 
 
 
25
 
26
+ These files are the outputs of interpretability analyses applied to ppiGPT predictions:
27
 
28
+ | File | Size | Description |
29
+ |------|------|-------------|
30
+ | `results/deeplift_motif_analysis_results.pkl` | 78 MB | Captum DeepLift per-residue attribution scores, motif discovery results, and position-wise statistics for all 2,168 protein pairs (1,084 PRS + 1,084 RRS) |
31
+ | `results/integrated_gradients_random_ppi_per_token_attributions.csv` | 174 MB | Captum Integrated Gradients per-token attribution scores for the 1,084 random reference set pairs |
 
 
32
 
33
+ ### ppiGPT Model Checkpoint (for reproducibility)
34
 
35
+ The ppiGPT model was created by **Kourosh Salehi-Ashtiani** and is included here solely to enable reproduction of the explainability analyses. It is not a product of the explainability work.
36
 
37
+ | File | Size | Description |
38
+ |------|------|-------------|
39
+ | `model/out_3e/ckpt.pt` | 1.0 GB | ppiGPT model checkpoint (3 epochs) |
40
+ | `model/data/meta.pkl` | 343 B | Character-level tokenizer metadata (29-token vocabulary) |
41
 
42
+ **ppiGPT architecture:** GPT-2 decoder-only transformer, 12 layers, 12 attention heads, 768 embedding dimensions, ~84.98M parameters. Trained from scratch on *Prochlorococcus* MED4 protein sequences with a 29-token character-level vocabulary (20 amino acids + 9 special tokens).
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## Code Repository
45
 
46
+ Analysis scripts, source datasets, publication figures, and documentation:
47
+ https://github.com/olympus-terminal/Prochlorococcus_interactome_model_explainability
 
 
 
 
48
 
49
  ## License
50