Update README.md
Browse files
README.md
CHANGED
|
@@ -23,7 +23,7 @@ inference:
|
|
| 23 |
# **REXzyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
|
| 24 |
**Work in Progress**
|
| 25 |
|
| 26 |
-
REXzyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine
|
| 27 |
for the generation of enzymes that catalize user-defined reactions.
|
| 28 |
|
| 29 |

|
|
@@ -32,19 +32,22 @@ It is possible to provide fine-grained input at the substrate level.
|
|
| 32 |
Akin to how translation machines have learned to translate between complex language pairs with great success,
|
| 33 |
often diverging in their representation at the character level (Japanese - English), we posit that an advanced architecture will
|
| 34 |
be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of 2480 reactions
|
| 35 |
-
and ~32M enzyme pairs and it produces sequences that are predicted to perform their intended reactions.
|
|
|
|
| 36 |
|
| 37 |
-
|
| 38 |
-
(Simplified molecular-input line-entry system)
|
|
|
|
| 39 |
|
| 40 |
After converting each of the reaction components you should combine them in the following scheme: ```ReactantA.ReactantB>AgentA>ProductA.ProductB```<br/>
|
| 41 |
Additionally, one should prepend the task suffix ```r2s``` and append the eos token ```</s>```
|
| 42 |
e.g. for the carbonic anhydrase reaction: ```r2sO.COO>>HCOOO.[H+]</s>```
|
| 43 |
|
| 44 |
-
|
|
|
|
| 45 |
|
| 46 |
```python
|
| 47 |
-
# left reactants (seperated by .) seperated by a equal sign from the products (seperated by .)
|
| 48 |
reactions = "CO2 . H2O = carbonic acid . H+"
|
| 49 |
# agents (seperated by .)
|
| 50 |
agent = ""
|
|
@@ -68,26 +71,34 @@ f"r2s{'.'.join(reagent)}>{'.'.join(agent)}>{'.'.join(product)}</s>"
|
|
| 68 |
```
|
| 69 |
|
| 70 |
We are still working in the analysis of the model for different tasks, including experimental testing.
|
| 71 |
-
See below
|
| 72 |
|
| 73 |
## **Model description**
|
| 74 |
|
| 75 |
REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/google/t5-efficient-large) architecture (which in turn is very similar to the current version of Google Translator)
|
| 76 |
-
and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling
|
| 77 |
|
| 78 |
-
REXzyme is a translation machine trained on portion the RHEA database containing
|
| 79 |
-
|
| 80 |
-
|
|
|
|
| 81 |
|
| 82 |
-
REXzyme was pre-trained with a supervised translation objective i.e., the model learned to
|
| 83 |
-
|
| 84 |
-
|
|
|
|
| 85 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
## **Model Performance**
|
| 88 |
|
| 89 |
- **Dataset curation**
|
| 90 |
-
We converted the reactions from rxn format to smile string including only left-to-right reactions.
|
|
|
|
|
|
|
| 91 |
<br/><br/>
|
| 92 |
- **General descriptors**
|
| 93 |
|
|
@@ -116,9 +127,9 @@ We converted the reactions from rxn format to smile string including only left-t
|
|
| 116 |
| Syntax | Identity | Alignment length |
|
| 117 |
| :--- | :----: | ---: |
|
| 118 |
| **Generated** | 74.29% | 406.0 |
|
| 119 |
-
| **Selection (<70%)
|
| 120 |
<br/><br/>
|
| 121 |
-
<sup>[1]|</sup> We excluded sequences ≥ 70%
|
| 122 |
|
| 123 |
## **How to generate from REXzyme**
|
| 124 |
REXzyme can be used with the HuggingFace transformer python package.
|
|
|
|
| 23 |
# **REXzyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
|
| 24 |
**Work in Progress**
|
| 25 |
|
| 26 |
+
REXzyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine -similar to Google Translator-
|
| 27 |
for the generation of enzymes that catalize user-defined reactions.
|
| 28 |
|
| 29 |

|
|
|
|
| 32 |
Akin to how translation machines have learned to translate between complex language pairs with great success,
|
| 33 |
often diverging in their representation at the character level (Japanese - English), we posit that an advanced architecture will
|
| 34 |
be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of 2480 reactions
|
| 35 |
+
and ~32M enzyme pairs and it produces sequences that are predicted to perform their intended reactions. A second
|
| 36 |
+
version of the model with 14k more reactions will be uploaded to this repository shortly.
|
| 37 |
|
| 38 |
+
you will need to provide a reaction in the SMILES format
|
| 39 |
+
(Simplified molecular-input line-entry system). A useful online server to convert from molecules to SMILES
|
| 40 |
+
can be found here: https://cactus.nci.nih.gov/chemical/structure.
|
| 41 |
|
| 42 |
After converting each of the reaction components you should combine them in the following scheme: ```ReactantA.ReactantB>AgentA>ProductA.ProductB```<br/>
|
| 43 |
Additionally, one should prepend the task suffix ```r2s``` and append the eos token ```</s>```
|
| 44 |
e.g. for the carbonic anhydrase reaction: ```r2sO.COO>>HCOOO.[H+]</s>```
|
| 45 |
|
| 46 |
+
We provide this python script to convert reactants to the required reaction format, but
|
| 47 |
+
we always recommend to draw and double-check the structures in a server like [cactus](https://cactus.nci.nih.gov/chemical/structure)
|
| 48 |
|
| 49 |
```python
|
| 50 |
+
# left reactants (seperated by '.') seperated by a equal sign from the products (also seperated by '.')
|
| 51 |
reactions = "CO2 . H2O = carbonic acid . H+"
|
| 52 |
# agents (seperated by .)
|
| 53 |
agent = ""
|
|
|
|
| 71 |
```
|
| 72 |
|
| 73 |
We are still working in the analysis of the model for different tasks, including experimental testing.
|
| 74 |
+
See below in this documentation information about the models' performance in different in-silico tasks and how to generate your own enzymes.
|
| 75 |
|
| 76 |
## **Model description**
|
| 77 |
|
| 78 |
REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/google/t5-efficient-large) architecture (which in turn is very similar to the current version of Google Translator)
|
| 79 |
+
and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 770 million parameters.
|
| 80 |
|
| 81 |
+
REXzyme is a translation machine trained on portion the [RHEA database](https://www.rhea-db.org/) containing 31,970,152 reaction-enzyme pairs.
|
| 82 |
+
A second dataset with >14k reactions is being trained and will be uploaded soon.
|
| 83 |
+
The pre-training was done on pairs of SMILES and amino acid sequences, tokenized with a char-level
|
| 84 |
+
Sentencepiece tokenizer. Note that two seperate tokenizers were used for input (./tokenizer_smiles) and labels (./tokenizer_aa).
|
| 85 |
|
| 86 |
+
REXzyme was pre-trained with a supervised translation objective i.e., the model learned to process the continous
|
| 87 |
+
representation of the reaction from the encoder to autoregressively (causual language modeling) produce the output.
|
| 88 |
+
The output tokens (amino acids) are generated one at a time, from left to right, and the model learns to match the original enzyme sequence.
|
| 89 |
+
Hence, the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction.
|
| 90 |
|
| 91 |
+
There are stark differences in the number of members among reaction classes.
|
| 92 |
+
However, since we are tokenizing the reaction SMILES on a character level,
|
| 93 |
+
the model has learnt dependencies among molecules and enzyme sequence features, and it can transfer learning from more to less populated
|
| 94 |
+
reaction classes.
|
| 95 |
|
| 96 |
## **Model Performance**
|
| 97 |
|
| 98 |
- **Dataset curation**
|
| 99 |
+
We converted the reactions from rxn format to smile string including only left-to-right reactions.
|
| 100 |
+
The enzyme sequences were truncated to 1024.
|
| 101 |
+
Enzymes catalyzing more than one reaction appear in multiple enzyme-reaction pairs.
|
| 102 |
<br/><br/>
|
| 103 |
- **General descriptors**
|
| 104 |
|
|
|
|
| 127 |
| Syntax | Identity | Alignment length |
|
| 128 |
| :--- | :----: | ---: |
|
| 129 |
| **Generated** | 74.29% | 406.0 |
|
| 130 |
+
| **Selection (<70%)** <sup>[1]|</sup> 57.20% | 338.1 |
|
| 131 |
<br/><br/>
|
| 132 |
+
<sup>[1]|</sup> We excluded sequences with %identities ≥ 70% and pLDDTs < 60%.
|
| 133 |
|
| 134 |
## **How to generate from REXzyme**
|
| 135 |
REXzyme can be used with the HuggingFace transformer python package.
|