andrewdalpino commited on
Commit
a6605b6
·
verified ·
1 Parent(s): 5f699ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -13
README.md CHANGED
@@ -1,3 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # ESMC Protein Function Predictor
2
 
3
  An Evolutionary-scale Model (ESM) for protein function prediction from amino acid sequences using the Gene Ontology (GO). Based on the ESM Cambrian Transformer architecture, pre-trained on [UniRef](https://www.uniprot.org/help/uniref), [MGnify](https://www.ebi.ac.uk/metagenomics), and the Joint Genome Institute's database and fine-tuned on the [AmiGO Boost](https://huggingface.co/datasets/andrewdalpino/AmiGO-Boost) protein function dataset, this model predicts the GO subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell.
@@ -17,7 +31,7 @@ The following pretrained models are available on HuggingFace Hub.
17
  | Name | Embedding Dim. | Attn. Heads | Encoder Layers | Context Length | QAT | Total Parameters |
18
  |---|---|---|---|---|---|---|
19
  | [andrewdalpino/ESMC-300M-Protein-Function](https://huggingface.co/andrewdalpino/ESMC-300M-Protein-Function) | 960 | 15 | 30 | 2048 | None | 361M |
20
- | [andrewdalpino/ESMC-300M-QAT-Protein-Function](https://huggingface.co/andrewdalpino/ESMC-300M-QAT-Protein-Function) | 960 | 15 | 30 | 2048 | int6w | 361M |
21
  | [andrewdalpino/ESMC-600M-Protein-Function](https://huggingface.co/andrewdalpino/ESMC-600M-Protein-Function) | 1152 | 18 | 36 | 2048 | None | 644M |
22
  | [andrewdalpino/ESMC-600M-QAT-Protein-Function](https://huggingface.co/andrewdalpino/ESMC-600M-QAT-Protein-Function) | 1152 | 18 | 36 | 2048 | int8w | 644M |
23
 
@@ -34,17 +48,12 @@ Then, we'll load the model weights from HuggingFace Hub and the GO graph using `
34
  ```python
35
  import torch
36
 
37
- import obonet
38
-
39
  from esm.tokenization import EsmSequenceTokenizer
40
 
41
  from esmc_function_classifier.model import EsmcGoTermClassifier
42
 
43
 
44
- model_name = "andrewdalpino/ESMC-300M-Protein-Function"
45
-
46
- # Visit https://geneontology.org/docs/download-ontology/ to download.
47
- go_db_path = "./dataset/go-basic.obo"
48
 
49
  sequence = "MPPKGHKKTADGDFRPVNSAGNTIQAKQKYSIDDLLYPKSTIKNLAKETLPDDAIISKDALTAIQRAATLFVSYMASHGNASAEAGGRKKIT"
50
 
@@ -54,10 +63,6 @@ tokenizer = EsmSequenceTokenizer()
54
 
55
  model = EsmcGoTermClassifier.from_pretrained(model_name)
56
 
57
- graph = obonet.read_obo(go_db_path)
58
-
59
- model.load_gene_ontology(graph)
60
-
61
  out = tokenizer(sequence, max_length=2048, truncation=True)
62
 
63
  input_ids = torch.tensor(out["input_ids"], dtype=torch.int64)
@@ -67,12 +72,28 @@ go_term_probabilities = model.predict_terms(
67
  )
68
  ```
69
 
70
- You can also out put the gene-ontology (GO) `networkx` subgraph for a given sequence like in the example below.
71
 
72
  ```python
 
 
 
 
 
 
 
 
 
 
 
 
73
  subgraph, go_term_probabilities = model.predict_subgraph(
74
  input_ids, top_p=top_p
75
  )
 
 
 
 
76
  ```
77
 
78
  ### Quantized Model
@@ -90,4 +111,4 @@ The training code can be found at [https://github.com/andrewdalpino/ESMC-Functio
90
  ## References:
91
 
92
  >- T. Hayes, et al. Simulating 500 million years of evolution with a language model, 2024.
93
- >- M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.
 
1
+ ---
2
+ datasets:
3
+ - andrewdalpino/AmiGO-Boost
4
+ metrics:
5
+ - precision
6
+ - recall
7
+ - f1
8
+ base_model:
9
+ - EvolutionaryScale/esmc-600m-2024-12
10
+ pipeline_tag: text-classification
11
+ tags:
12
+ - gene-ontology
13
+ ---
14
+
15
  # ESMC Protein Function Predictor
16
 
17
  An Evolutionary-scale Model (ESM) for protein function prediction from amino acid sequences using the Gene Ontology (GO). Based on the ESM Cambrian Transformer architecture, pre-trained on [UniRef](https://www.uniprot.org/help/uniref), [MGnify](https://www.ebi.ac.uk/metagenomics), and the Joint Genome Institute's database and fine-tuned on the [AmiGO Boost](https://huggingface.co/datasets/andrewdalpino/AmiGO-Boost) protein function dataset, this model predicts the GO subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell.
 
31
  | Name | Embedding Dim. | Attn. Heads | Encoder Layers | Context Length | QAT | Total Parameters |
32
  |---|---|---|---|---|---|---|
33
  | [andrewdalpino/ESMC-300M-Protein-Function](https://huggingface.co/andrewdalpino/ESMC-300M-Protein-Function) | 960 | 15 | 30 | 2048 | None | 361M |
34
+ | [andrewdalpino/ESMC-300M-QAT-Protein-Function](https://huggingface.co/andrewdalpino/ESMC-300M-QAT-Protein-Function) | 960 | 15 | 30 | 2048 | int8w | 361M |
35
  | [andrewdalpino/ESMC-600M-Protein-Function](https://huggingface.co/andrewdalpino/ESMC-600M-Protein-Function) | 1152 | 18 | 36 | 2048 | None | 644M |
36
  | [andrewdalpino/ESMC-600M-QAT-Protein-Function](https://huggingface.co/andrewdalpino/ESMC-600M-QAT-Protein-Function) | 1152 | 18 | 36 | 2048 | int8w | 644M |
37
 
 
48
  ```python
49
  import torch
50
 
 
 
51
  from esm.tokenization import EsmSequenceTokenizer
52
 
53
  from esmc_function_classifier.model import EsmcGoTermClassifier
54
 
55
 
56
+ model_name = "andrewdalpino/ESMC-600M-Protein-Function"
 
 
 
57
 
58
  sequence = "MPPKGHKKTADGDFRPVNSAGNTIQAKQKYSIDDLLYPKSTIKNLAKETLPDDAIISKDALTAIQRAATLFVSYMASHGNASAEAGGRKKIT"
59
 
 
63
 
64
  model = EsmcGoTermClassifier.from_pretrained(model_name)
65
 
 
 
 
 
66
  out = tokenizer(sequence, max_length=2048, truncation=True)
67
 
68
  input_ids = torch.tensor(out["input_ids"], dtype=torch.int64)
 
72
  )
73
  ```
74
 
75
+ You can also output the gene-ontology (GO) `networkx` subgraph for a given sequence like in the example below. You'll need an up-to-date gene ontology database that you can import using the `obonet` package.
76
 
77
  ```python
78
+ import networkx as nx
79
+
80
+ import obonet
81
+
82
+
83
+ # Visit https://geneontology.org/docs/download-ontology/ to download.
84
+ go_db_path = "./dataset/go-basic.obo"
85
+
86
+ graph = obonet.read_obo(go_db_path)
87
+
88
+ model.load_gene_ontology(graph)
89
+
90
  subgraph, go_term_probabilities = model.predict_subgraph(
91
  input_ids, top_p=top_p
92
  )
93
+
94
+ json = nx.node_link_data(subgraph)
95
+
96
+ print(json)
97
  ```
98
 
99
  ### Quantized Model
 
111
  ## References:
112
 
113
  >- T. Hayes, et al. Simulating 500 million years of evolution with a language model, 2024.
114
+ >- M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.