Update README.md
Browse files
README.md
CHANGED
|
@@ -13,7 +13,7 @@ inference: false
|
|
| 13 |
- Núria Mimbrero Pelegrí (GitHub [@nuriamimbreropelegri](https://github.com/nuriamimbreropelegri);)
|
| 14 |
- Michael Heinzinger (GitHub [@mheinzinger](https://github.com/mheinzinger); Twitter [@HeinzingerM](https://twitter.com/HeinzingerM))
|
| 15 |
- Noelia Ferruz (GitHub [@noeliaferruz](https://github.com/noeliaferruz); Twitter [@ferruz_noelia](https://twitter.com/ferruz_noelia); Webpage: [www.aiproteindesign.com](https://www.aiproteindesign.com) )
|
| 16 |
-
|
| 17 |
|
| 18 |
# **REXzyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
|
| 19 |
**Work in Progress**
|
|
@@ -26,44 +26,18 @@ for the generation of enzymes that catalize user-defined reactions.
|
|
| 26 |
It is possible to provide fine-grained input at the substrate level.
|
| 27 |
Akin to how translation machines have learned to translate between complex language pairs with great success,
|
| 28 |
often diverging in their representation at the character level (Japanese - English), we posit that an advanced architecture will
|
| 29 |
-
be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of
|
| 30 |
-
and
|
| 31 |
-
version of the model with 14k more reactions will be uploaded to this repository shortly.
|
| 32 |
|
| 33 |
you will need to provide a reaction in the SMILES format
|
| 34 |
(Simplified molecular-input line-entry system). A useful online server to convert from molecules to SMILES
|
| 35 |
can be found here: https://cactus.nci.nih.gov/chemical/structure.
|
| 36 |
|
| 37 |
-
After converting each of the reaction components you should
|
| 38 |
-
|
| 39 |
-
e.g. for the carbonic anhydrase reaction: ```r2sO.COO>>HCOOO.[H+]</s>```
|
| 40 |
-
|
| 41 |
-
We provide this python script to convert reactants to the required reaction format, but
|
| 42 |
-
we always recommend to draw and double-check the structures in a server like [cactus](https://cactus.nci.nih.gov/chemical/structure)
|
| 43 |
-
|
| 44 |
-
```python
|
| 45 |
-
# left reactants (seperated by '.') seperated by a equal sign from the products (also seperated by '.')
|
| 46 |
-
reactions = "CO2 . H2O = carbonic acid . H+"
|
| 47 |
-
# agents (seperated by .)
|
| 48 |
-
agent = ""
|
| 49 |
|
| 50 |
-
|
| 51 |
-
from urllib.request import urlopen
|
| 52 |
-
from urllib.parse import quote
|
| 53 |
|
| 54 |
-
def CIRconvert(ids):
|
| 55 |
-
try:
|
| 56 |
-
url = 'http://cactus.nci.nih.gov/chemical/structure/' + quote(ids) + '/smiles'
|
| 57 |
-
ans = urlopen(url).read().decode('utf8')
|
| 58 |
-
return ans
|
| 59 |
-
except:
|
| 60 |
-
return 'Did not work'
|
| 61 |
-
|
| 62 |
-
reagent = [CIRconvert(i) for i in reactions.replace(' ','').split('=')[0].split('.') if i != ""]
|
| 63 |
-
agent = [CIRconvert(i) for i in agent.replace(' ','').split('.') if i != ""]
|
| 64 |
-
product = [CIRconvert(i) for i in reactions.replace(' ','').split('=')[1].split('.') if i != ""]
|
| 65 |
-
f"r2s{'.'.join(reagent)}>{'.'.join(agent)}>{'.'.join(product)}</s>"
|
| 66 |
-
```
|
| 67 |
|
| 68 |
We are still working in the analysis of the model for different tasks, including experimental testing.
|
| 69 |
See below in this documentation information about the models' performance in different in-silico tasks and how to generate your own enzymes.
|
|
@@ -73,10 +47,8 @@ See below in this documentation information about the models' performance in dif
|
|
| 73 |
REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/google/t5-efficient-large) architecture (which in turn is very similar to the current version of Google Translator)
|
| 74 |
and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 770 million parameters.
|
| 75 |
|
| 76 |
-
REXzyme is a translation machine trained on portion the [RHEA database](https://www.rhea-db.org/) containing
|
| 77 |
-
|
| 78 |
-
The pre-training was done on pairs of SMILES and amino acid sequences, tokenized with a char-level
|
| 79 |
-
Sentencepiece tokenizer. Note that two seperate tokenizers were used for input (./tokenizer_smiles) and labels (./tokenizer_aa).
|
| 80 |
|
| 81 |
REXzyme was pre-trained with a supervised translation objective i.e., the model learned to process the continous
|
| 82 |
representation of the reaction from the encoder to autoregressively (causual language modeling) produce the output.
|
|
|
|
| 13 |
- Núria Mimbrero Pelegrí (GitHub [@nuriamimbreropelegri](https://github.com/nuriamimbreropelegri);)
|
| 14 |
- Michael Heinzinger (GitHub [@mheinzinger](https://github.com/mheinzinger); Twitter [@HeinzingerM](https://twitter.com/HeinzingerM))
|
| 15 |
- Noelia Ferruz (GitHub [@noeliaferruz](https://github.com/noeliaferruz); Twitter [@ferruz_noelia](https://twitter.com/ferruz_noelia); Webpage: [www.aiproteindesign.com](https://www.aiproteindesign.com) )
|
| 16 |
+
- Alex Vicente
|
| 17 |
|
| 18 |
# **REXzyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
|
| 19 |
**Work in Progress**
|
|
|
|
| 26 |
It is possible to provide fine-grained input at the substrate level.
|
| 27 |
Akin to how translation machines have learned to translate between complex language pairs with great success,
|
| 28 |
often diverging in their representation at the character level (Japanese - English), we posit that an advanced architecture will
|
| 29 |
+
be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of 16,011 unique reactions
|
| 30 |
+
and 20,911,485 enzyme pairs and it produces sequences that are predicted to perform their intended reactions.
|
|
|
|
| 31 |
|
| 32 |
you will need to provide a reaction in the SMILES format
|
| 33 |
(Simplified molecular-input line-entry system). A useful online server to convert from molecules to SMILES
|
| 34 |
can be found here: https://cactus.nci.nih.gov/chemical/structure.
|
| 35 |
|
| 36 |
+
After converting each of the reaction components you should convert them to canonical SMILEs using RDKit (https://www.rdkit.org/docs/GettingStartedInPython.html)
|
| 37 |
+
Finally, you should combine them in the following scheme: ```ReactantA.ReactantB>AgentA>ProductA.ProductB```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
+
e.g. for the carbonic anhydrase reaction: ```O=C([O-])O.[H+]>>O.O=C=O```
|
|
|
|
|
|
|
| 40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
We are still working in the analysis of the model for different tasks, including experimental testing.
|
| 43 |
See below in this documentation information about the models' performance in different in-silico tasks and how to generate your own enzymes.
|
|
|
|
| 47 |
REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/google/t5-efficient-large) architecture (which in turn is very similar to the current version of Google Translator)
|
| 48 |
and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 770 million parameters.
|
| 49 |
|
| 50 |
+
REXzyme is a translation machine trained on portion the [RHEA database](https://www.rhea-db.org/) containing 20,911,485 reaction-enzyme pairs.
|
| 51 |
+
The pre-training was done on pairs of SMILES and amino acid sequences. Note that two seperate tokenizers were used for input (./tokenizer_smiles) and labels (./tokenizer_aa).
|
|
|
|
|
|
|
| 52 |
|
| 53 |
REXzyme was pre-trained with a supervised translation objective i.e., the model learned to process the continous
|
| 54 |
representation of the reaction from the encoder to autoregressively (causual language modeling) produce the output.
|