adibvafa commited on
Commit
6ddb141
·
verified ·
1 Parent(s): 4c85dad

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +46 -57
README.md CHANGED
@@ -10,74 +10,63 @@ tags:
10
  - bioinformatics
11
  pipeline_tag: text-generation
12
  datasets:
13
- - wanglab/cafa5
14
  ---
15
 
16
- # GO-GPT: Gene Ontology Prediction from Protein Sequences
 
 
17
 
18
- GO-GPT is a decoder-only transformer model for predicting Gene Ontology (GO) terms from protein sequences. It combines ESM2 protein language model embeddings with an autoregressive decoder to generate GO term annotations across all three ontology aspects: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC).
19
-
20
- ## Quick Start
21
-
22
- 1. Clone the repository:
23
- ```bash
24
- git clone https://github.com/bowang-lab/BioReason-Pro
25
- cd BioReason-Pro/gogpt
26
- ```
27
 
28
- 2. Run the inference notebook or use Python directly:
29
- ```python
30
- import sys
31
- sys.path.insert(0, "src")
32
 
33
- from gogpt import GOGPTPredictor
34
 
35
- # Load from HuggingFace (downloads ~4GB on first run)
36
- predictor = GOGPTPredictor.from_pretrained("wanglab/gogpt")
37
-
38
- # Predict GO terms
39
- predictions = predictor.predict(
40
- sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH",
41
- organism="Homo sapiens"
42
- )
43
-
44
- print(predictions)
45
- # {"MF": ["GO:0003674", "GO:0005488", ...],
46
- # "BP": ["GO:0008150", "GO:0008152", ...],
47
- # "CC": ["GO:0005575", "GO:0110165", ...]}
48
- ```
49
 
50
- ## Model Architecture
51
 
52
  | Component | Description |
53
  |-----------|-------------|
54
  | Protein Encoder | ESM2-3B (`facebook/esm2_t36_3B_UR50D`) |
55
  | Decoder | 12-layer GPT with prefix causal attention |
56
- | Embedding Dim | 900 |
57
- | Attention Heads | 12 |
58
  | Total Parameters | ~3.2B (3B ESM2 + 200M decoder) |
59
 
60
- ## Supported Organisms
61
-
62
- GO-GPT supports organism-conditioned prediction for 200 organisms plus an `<UNKNOWN>` category (201 total). See `organism_list.txt` for the full list.
63
-
64
- Common organisms include:
65
- - Homo sapiens
66
- - Mus musculus
67
- - Escherichia coli (various strains)
68
- - Saccharomyces cerevisiae
69
- - Arabidopsis thaliana
70
- - Drosophila melanogaster
71
-
72
- For organisms not in the training set, predictions will use the `<UNKNOWN>` embedding.
73
-
74
- ## Files in This Repository
75
-
76
- | File | Description |
77
- |------|-------------|
78
- | `model.ckpt` | Model weights (PyTorch Lightning checkpoint) |
79
- | `config.yaml` | Model architecture configuration |
80
- | `tokenizer_info.json` | Token vocabulary metadata |
81
- | `go_tokenizer.json` | GO term to token ID mapping |
82
- | `organism_mapper.json` | Organism name to ID mapping |
83
- | `organism_list.txt` | Human-readable list of 201 supported organisms |
 
 
 
 
 
 
 
 
 
10
  - bioinformatics
11
  pipeline_tag: text-generation
12
  datasets:
13
+ - wanglab/gogpt-training-data
14
  ---
15
 
16
+ <h1 align="center">
17
+ 🧬 BioReason-Pro<br>Advancing Protein Function Prediction with<br>Multimodal Biological Reasoning
18
+ </h1>
19
 
20
+ <p align="center">
21
+ <a href="https://www.biorxiv.org/content/10.64898/2026.03.19.712954v1" target="_blank"><img src="https://img.shields.io/badge/bioRxiv-2026.03.19.712954-FF6B6B?style=for-the-badge&logo=arxiv&logoColor=white" alt="bioRxiv"></a>
22
+ <a href="https://github.com/bowang-lab/BioReason-Pro"><img src="https://img.shields.io/badge/GitHub-Code-4A90E2?style=for-the-badge&logo=github&logoColor=white" alt="GitHub"></a>
23
+ <a href="https://bioreason.net"><img src="https://img.shields.io/badge/Website-Online-00B89E?style=for-the-badge&logo=internet-explorer&logoColor=white" alt="Website"></a>
24
+ <a href="https://huggingface.co/collections/wanglab/bioreason-pro"><img src="https://img.shields.io/badge/HuggingFace-Models & Data-FFBF00?style=for-the-badge&logo=huggingface&logoColor=white" alt="HuggingFace"></a>
25
+ </p>
 
 
 
26
 
27
+ <br>
 
 
 
28
 
29
+ ## GO-GPT
30
 
31
+ GO-GPT is a decoder-only transformer model for predicting Gene Ontology (GO) terms from protein sequences. It combines ESM2 protein language model embeddings with an autoregressive decoder to generate GO term annotations across all three ontology aspects: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC).
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
+ Unlike discriminative methods, GO-GPT treats GO prediction as a sequence generation task, capturing hierarchical and cross-aspect dependencies to achieve state-of-the-art weighted F_max of 0.65–0.70.
34
 
35
  | Component | Description |
36
  |-----------|-------------|
37
  | Protein Encoder | ESM2-3B (`facebook/esm2_t36_3B_UR50D`) |
38
  | Decoder | 12-layer GPT with prefix causal attention |
 
 
39
  | Total Parameters | ~3.2B (3B ESM2 + 200M decoder) |
40
 
41
+ **Training data:** 120,143 proteins, available at [wanglab/gogpt-training-data](https://huggingface.co/datasets/wanglab/gogpt-training-data).
42
+
43
+ **Code:** [github.com/bowang-lab/BioReason-Pro/gogpt](https://github.com/bowang-lab/BioReason-Pro/tree/main/gogpt)
44
+
45
+
46
+ ## Citation
47
+
48
+ If you find this work useful, please cite our papers:
49
+
50
+ ```bibtex
51
+ @article {Fallahpour2026.03.19.712954,
52
+ author = {Fallahpour, Adibvafa and Seyed-Ahmadi, Arman and Idehpour, Parsa and Ibrahim, Omar and Gupta, Purav and Naimer, Jack and Zhu, Kevin and Shah, Arnav and Ma, Shihao and Adduri, Abhinav and G{\"u}loglu, Talu and Liu, Nuo and Cui, Haotian and Jain, Arihant and de Castro, Max and Fallahpour, Amirfaham and Cembellin-Prieto, Antonio and Stiles, John S. and Nem{\v c}ko, Filip and Nevue, Alexander A. and Moon, Hyungseok C. and Sosnick, Lucas and Markham, Olivia and Duan, Haonan and Lee, Michelle Y. Y. and Salvador, Andrea F. M. and Maddison, Chris J. and Thaiss, Christoph A. and Ricci-Tam, Chiara and Plosky, Brian S. and Burke, Dave P. and Hsu, Patrick D. and Goodarzi, Hani and Wang, Bo},
53
+ title = {BioReason-Pro: Advancing Protein Function Prediction with Multimodal Biological Reasoning},
54
+ elocation-id = {2026.03.19.712954},
55
+ year = {2026},
56
+ doi = {10.64898/2026.03.19.712954},
57
+ publisher = {Cold Spring Harbor Laboratory},
58
+ URL = {https://www.biorxiv.org/content/early/2026/03/20/2026.03.19.712954},
59
+ eprint = {https://www.biorxiv.org/content/early/2026/03/20/2026.03.19.712954.full.pdf},
60
+ journal = {bioRxiv}
61
+ }
62
+
63
+ @misc{fallahpour2025bioreasonincentivizingmultimodalbiological,
64
+ title={BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model},
65
+ author={Adibvafa Fallahpour and Andrew Magnuson and Purav Gupta and Shihao Ma and Jack Naimer and Arnav Shah and Haonan Duan and Omar Ibrahim and Hani Goodarzi and Chris J. Maddison and Bo Wang},
66
+ year={2025},
67
+ eprint={2505.23579},
68
+ archivePrefix={arXiv},
69
+ primaryClass={cs.LG},
70
+ url={https://arxiv.org/abs/2505.23579},
71
+ }
72
+ ```