adibvafa commited on
Commit
7659bd0
·
verified ·
1 Parent(s): 39f24fa

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - protein
7
+ - gene-ontology
8
+ - function-prediction
9
+ - biology
10
+ - bioinformatics
11
+ pipeline_tag: text-generation
12
+ datasets:
13
+ - wanglab/cafa5
14
+ ---
15
+
16
+ # GO-GPT: Gene Ontology Prediction from Protein Sequences
17
+
18
+ GO-GPT is a decoder-only transformer model for predicting Gene Ontology (GO) terms from protein sequences. It combines ESM2 protein language model embeddings with an autoregressive decoder to generate GO term annotations across all three ontology aspects: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC).
19
+
20
+ ## Quick Start
21
+
22
+ 1. Clone the repository:
23
+ ```bash
24
+ git clone https://github.com/YOUR_ORG/gogpt
25
+ cd gogpt
26
+ ```
27
+
28
+ 2. Run the inference notebook or use Python directly:
29
+ ```python
30
+ import sys
31
+ sys.path.insert(0, "src")
32
+
33
+ from gogpt import GOGPTPredictor
34
+
35
+ # Load from HuggingFace (downloads ~4GB on first run)
36
+ predictor = GOGPTPredictor.from_pretrained("armansa1/gogpt-dev")
37
+
38
+ # Predict GO terms
39
+ predictions = predictor.predict(
40
+ sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH",
41
+ organism="Homo sapiens"
42
+ )
43
+
44
+ print(predictions)
45
+ # {'MF': ['GO:0003674', 'GO:0005488', ...],
46
+ # 'BP': ['GO:0008150', 'GO:0008152', ...],
47
+ # 'CC': ['GO:0005575', 'GO:0110165', ...]}
48
+ ```
49
+
50
+ ## Model Architecture
51
+
52
+ | Component | Description |
53
+ |-----------|-------------|
54
+ | Protein Encoder | ESM2-3B (`facebook/esm2_t36_3B_UR50D`) |
55
+ | Decoder | 12-layer GPT with prefix causal attention |
56
+ | Embedding Dim | 900 |
57
+ | Attention Heads | 12 |
58
+ | Total Parameters | ~3.2B (3B ESM2 + 200M decoder) |
59
+
60
+ ## Supported Organisms
61
+
62
+ GO-GPT supports organism-conditioned prediction for 200 organisms plus an `<UNKNOWN>` category (201 total). See `organism_list.txt` for the full list.
63
+
64
+ Common organisms include:
65
+ - Homo sapiens
66
+ - Mus musculus
67
+ - Escherichia coli (various strains)
68
+ - Saccharomyces cerevisiae
69
+ - Arabidopsis thaliana
70
+ - Drosophila melanogaster
71
+
72
+ For organisms not in the training set, predictions will use the `<UNKNOWN>` embedding.
73
+
74
+ ## Files in This Repository
75
+
76
+ | File | Description |
77
+ |------|-------------|
78
+ | `model.ckpt` | Model weights (PyTorch Lightning checkpoint) |
79
+ | `config.yaml` | Model architecture configuration |
80
+ | `tokenizer_info.json` | Token vocabulary metadata |
81
+ | `go_tokenizer.json` | GO term to token ID mapping |
82
+ | `organism_mapper.json` | Organism name to ID mapping |
83
+ | `organism_list.txt` | Human-readable list of 201 supported organisms |