GreenGenomicsLab commited on
Commit
01e3e14
·
1 Parent(s): ef7aa2c

Add model card

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - protein-protein-interaction
6
+ - interpretability
7
+ - explainability
8
+ - prochlorococcus
9
+ - cyanobacteria
10
+ - deeplift
11
+ - captum
12
+ - pytorch
13
+ library_name: pytorch
14
+ pipeline_tag: other
15
+ ---
16
+
17
+ # ppiGPT — Prochlorococcus MED4 Protein-Protein Interaction Model
18
+
19
+ ## Model Description
20
+
21
+ ppiGPT is a GPT-2 architecture trained from scratch to predict protein-protein interactions in *Prochlorococcus* MED4. The model takes two protein sequences as input and predicts whether they interact.
22
+
23
+ | Parameter | Value |
24
+ |-----------|-------|
25
+ | Architecture | GPT-2 (decoder-only transformer) |
26
+ | Layers | 12 |
27
+ | Attention heads | 12 |
28
+ | Embedding dimension | 768 |
29
+ | Total parameters | ~84.98M |
30
+ | Vocabulary | 29 tokens (20 amino acids + 9 special tokens) |
31
+ | Max sequence length | 1024 tokens |
32
+ | Training epochs | 3 |
33
+
34
+ ## Files
35
+
36
+ | File | Description | Size |
37
+ |------|-------------|------|
38
+ | `model/out_3e/ckpt.pt` | Trained model checkpoint | 1.0 GB |
39
+ | `model/data/meta.pkl` | Character-level tokenizer metadata | 343 B |
40
+ | `results/deeplift_motif_analysis_results.pkl` | DeepLift attributions for all 2,168 protein pairs | 78 MB |
41
+ | `results/integrated_gradients_random_ppi_per_token_attributions.csv` | Per-token Integrated Gradients attributions | 174 MB |
42
+
43
+ ## Training Data
44
+
45
+ The model was trained on *Prochlorococcus* MED4 protein sequences with interaction labels derived from yeast two-hybrid (Y2H) experiments:
46
+
47
+ - **Positive Reference Set (PRS)**: 1,084 experimentally validated Y2H interactions
48
+ - **Random Reference Set (RRS)**: 1,084 randomly paired MED4 proteins
49
+
50
+ ## Input Format
51
+
52
+ Protein pairs are encoded as character-level sequences:
53
+
54
+ ```
55
+ <ps1>,PROTEIN_SEQUENCE_1,<ps2>,PROTEIN_SEQUENCE_2,<
56
+ ```
57
+
58
+ The model predicts interaction probability via softmax over output tokens at positions 25-26.
59
+
60
+ ## Intended Use
61
+
62
+ This model and its associated interpretability outputs accompany a manuscript on *Prochlorococcus* MED4 interactome analysis. The repository provides tools for understanding what sequence features drive the model's interaction predictions, including DeepLift attribution, Integrated Gradients, attention analysis, and alanine substitution scanning.
63
+
64
+ ## Code Repository
65
+
66
+ Analysis scripts and source data: [github.com/olympus-terminal/Prochlorococcus_interactome_model_explainability](https://github.com/olympus-terminal/Prochlorococcus_interactome_model_explainability)
67
+
68
+ ## Framework
69
+
70
+ - PyTorch >= 2.0.0
71
+ - Captum (for DeepLift and Integrated Gradients)
72
+
73
+ ## License
74
+
75
+ MIT