nuriamimbreropelegri commited on
Commit
d90b6ba
·
verified ·
1 Parent(s): 13d04bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -36
README.md CHANGED
@@ -13,7 +13,7 @@ inference: false
13
  - Núria Mimbrero Pelegrí (GitHub [@nuriamimbreropelegri](https://github.com/nuriamimbreropelegri);)
14
  - Michael Heinzinger (GitHub [@mheinzinger](https://github.com/mheinzinger); Twitter [@HeinzingerM](https://twitter.com/HeinzingerM))
15
  - Noelia Ferruz (GitHub [@noeliaferruz](https://github.com/noeliaferruz); Twitter [@ferruz_noelia](https://twitter.com/ferruz_noelia); Webpage: [www.aiproteindesign.com](https://www.aiproteindesign.com) )
16
-
17
 
18
  # **REXzyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
19
  **Work in Progress**
@@ -26,44 +26,18 @@ for the generation of enzymes that catalize user-defined reactions.
26
  It is possible to provide fine-grained input at the substrate level.
27
  Akin to how translation machines have learned to translate between complex language pairs with great success,
28
  often diverging in their representation at the character level (Japanese - English), we posit that an advanced architecture will
29
- be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of 2480 reactions
30
- and ~32M enzyme pairs and it produces sequences that are predicted to perform their intended reactions. A second
31
- version of the model with 14k more reactions will be uploaded to this repository shortly.
32
 
33
  you will need to provide a reaction in the SMILES format
34
  (Simplified molecular-input line-entry system). A useful online server to convert from molecules to SMILES
35
  can be found here: https://cactus.nci.nih.gov/chemical/structure.
36
 
37
- After converting each of the reaction components you should combine them in the following scheme: ```ReactantA.ReactantB>AgentA>ProductA.ProductB```<br/>
38
- Additionally, one should prepend the task suffix ```r2s``` and append the eos token ```</s>```
39
- e.g. for the carbonic anhydrase reaction: ```r2sO.COO>>HCOOO.[H+]</s>```
40
-
41
- We provide this python script to convert reactants to the required reaction format, but
42
- we always recommend to draw and double-check the structures in a server like [cactus](https://cactus.nci.nih.gov/chemical/structure)
43
-
44
- ```python
45
- # left reactants (seperated by '.') seperated by a equal sign from the products (also seperated by '.')
46
- reactions = "CO2 . H2O = carbonic acid . H+"
47
- # agents (seperated by .)
48
- agent = ""
49
 
50
- # https://stackoverflow.com/questions/54930121/converting-molecule-name-to-smiles
51
- from urllib.request import urlopen
52
- from urllib.parse import quote
53
 
54
- def CIRconvert(ids):
55
- try:
56
- url = 'http://cactus.nci.nih.gov/chemical/structure/' + quote(ids) + '/smiles'
57
- ans = urlopen(url).read().decode('utf8')
58
- return ans
59
- except:
60
- return 'Did not work'
61
-
62
- reagent = [CIRconvert(i) for i in reactions.replace(' ','').split('=')[0].split('.') if i != ""]
63
- agent = [CIRconvert(i) for i in agent.replace(' ','').split('.') if i != ""]
64
- product = [CIRconvert(i) for i in reactions.replace(' ','').split('=')[1].split('.') if i != ""]
65
- f"r2s{'.'.join(reagent)}>{'.'.join(agent)}>{'.'.join(product)}</s>"
66
- ```
67
 
68
  We are still working in the analysis of the model for different tasks, including experimental testing.
69
  See below in this documentation information about the models' performance in different in-silico tasks and how to generate your own enzymes.
@@ -73,10 +47,8 @@ See below in this documentation information about the models' performance in dif
73
  REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/google/t5-efficient-large) architecture (which in turn is very similar to the current version of Google Translator)
74
  and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 770 million parameters.
75
 
76
- REXzyme is a translation machine trained on portion the [RHEA database](https://www.rhea-db.org/) containing 31,970,152 reaction-enzyme pairs.
77
- A second dataset with >14k reactions is being trained and will be uploaded soon.
78
- The pre-training was done on pairs of SMILES and amino acid sequences, tokenized with a char-level
79
- Sentencepiece tokenizer. Note that two seperate tokenizers were used for input (./tokenizer_smiles) and labels (./tokenizer_aa).
80
 
81
  REXzyme was pre-trained with a supervised translation objective i.e., the model learned to process the continous
82
  representation of the reaction from the encoder to autoregressively (causual language modeling) produce the output.
 
13
  - Núria Mimbrero Pelegrí (GitHub [@nuriamimbreropelegri](https://github.com/nuriamimbreropelegri);)
14
  - Michael Heinzinger (GitHub [@mheinzinger](https://github.com/mheinzinger); Twitter [@HeinzingerM](https://twitter.com/HeinzingerM))
15
  - Noelia Ferruz (GitHub [@noeliaferruz](https://github.com/noeliaferruz); Twitter [@ferruz_noelia](https://twitter.com/ferruz_noelia); Webpage: [www.aiproteindesign.com](https://www.aiproteindesign.com) )
16
+ - Alex Vicente
17
 
18
  # **REXzyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
19
  **Work in Progress**
 
26
  It is possible to provide fine-grained input at the substrate level.
27
  Akin to how translation machines have learned to translate between complex language pairs with great success,
28
  often diverging in their representation at the character level (Japanese - English), we posit that an advanced architecture will
29
+ be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of 16,011 unique reactions
30
+ and 20,911,485 enzyme pairs and it produces sequences that are predicted to perform their intended reactions.
 
31
 
32
  you will need to provide a reaction in the SMILES format
33
  (Simplified molecular-input line-entry system). A useful online server to convert from molecules to SMILES
34
  can be found here: https://cactus.nci.nih.gov/chemical/structure.
35
 
36
+ After converting each of the reaction components you should convert them to canonical SMILEs using RDKit (https://www.rdkit.org/docs/GettingStartedInPython.html)
37
+ Finally, you should combine them in the following scheme: ```ReactantA.ReactantB>AgentA>ProductA.ProductB```
 
 
 
 
 
 
 
 
 
 
38
 
39
+ e.g. for the carbonic anhydrase reaction: ```O=C([O-])O.[H+]>>O.O=C=O```
 
 
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  We are still working in the analysis of the model for different tasks, including experimental testing.
43
  See below in this documentation information about the models' performance in different in-silico tasks and how to generate your own enzymes.
 
47
  REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/google/t5-efficient-large) architecture (which in turn is very similar to the current version of Google Translator)
48
  and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 770 million parameters.
49
 
50
+ REXzyme is a translation machine trained on portion the [RHEA database](https://www.rhea-db.org/) containing 20,911,485 reaction-enzyme pairs.
51
+ The pre-training was done on pairs of SMILES and amino acid sequences. Note that two seperate tokenizers were used for input (./tokenizer_smiles) and labels (./tokenizer_aa).
 
 
52
 
53
  REXzyme was pre-trained with a supervised translation objective i.e., the model learned to process the continous
54
  representation of the reaction from the encoder to autoregressively (causual language modeling) produce the output.