AI4PD
/

REXzyme

@@ -23,7 +23,7 @@ inference:
 # **REXzyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
 **Work in Progress**
-REXzyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine, similar to Google Translator,
 for the generation of enzymes that catalize user-defined reactions.
 ![Inference of REXzyme](./rexzyme3.png)
@@ -32,19 +32,22 @@ It is possible to provide fine-grained input at the substrate level.
 Akin to how translation machines have learned to translate between complex language pairs with great success,
 often diverging in their representation at the character level (Japanese - English), we posit that an advanced architecture will
 be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of 2480 reactions
-and ~32M enzyme pairs and it produces sequences that are predicted to perform their intended reactions.
-To run it, you will need to provide a reaction in the SMILE format
-(Simplified molecular-input line-entry system), which you can do online here: https://cactus.nci.nih.gov/chemical/structure.
 After converting each of the reaction components you should combine them in the following scheme: ```ReactantA.ReactantB>AgentA>ProductA.ProductB```<br/>
 Additionally, one should prepend the task suffix ```r2s``` and append the eos token ```</s>```
 e.g. for the carbonic anhydrase reaction: ```r2sO.COO>>HCOOO.[H+]</s>```
-or via this simple python script:
 ```python
-#  left reactants (seperated by .) seperated by a equal sign from the products (seperated by .)
 reactions =  "CO2 . H2O =  carbonic acid . H+"
 # agents (seperated by .)
 agent = ""
@@ -68,26 +71,34 @@ f"r2s{'.'.join(reagent)}>{'.'.join(agent)}>{'.'.join(product)}</s>"
 ```
 We are still working in the analysis of the model for different tasks, including experimental testing.
-See below for information about the models' performance in different in-silico tasks and how to generate your own enzymes.
 ## **Model description**
 REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/google/t5-efficient-large) architecture (which in turn is very similar to the current version of Google Translator)
-and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 737.72 million parameters.
-REXzyme is a translation machine trained on portion the RHEA database containing 31970152 reaction-enzyme pairs.
-The pre-training was done on pairs of smiles and amino acid sequences, tokenized with a char-level
-Sentencepiece tokenizer. Note that two seperate tokenizers were used for input (smiles_tokenizer) and labels (aa_tokenizer).
-REXzyme was pre-trained with a supervised translation objective  i.e., the model learned to use the continous representation of the reaction from the encoder to autoregressivly (causual language modeling) produce the output by shifting it right on token (amino acid) at a time trying to match the target enzyme sequence. Hence, the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction.
-There are stark differences in the number of members among Reaction classes, and for this reason. Since we are tokenizing the reaction smiles on a char level, classes with few reactions can profit from the knwodledge gained for classes catalyzing similar reactions that have a lot of members.
 ## **Model Performance**
 - **Dataset curation**
-We converted the reactions from rxn format to smile string including only left-to-right reactions. The enzyme sequences were truncated to 1024. Enzymes catalyzing more than one reaction were given multiple enzyme-reaction entries.
 <br/><br/>
 - **General descriptors**
@@ -116,9 +127,9 @@ We converted the reactions from rxn format to smile string including only left-t
     | Syntax              | Identity    | Alignment length |
     | :---                |    :----:   |          ---:    |
     | **Generated**       | 74.29%      | 406.0            |
-    | **Selection (<70%)**<sup>[1]|</sup> 57.20%      | 338.1            |
 <br/><br/>
-<sup>[1]|</sup> We excluded sequences ≥ 70%
 ## **How to generate from REXzyme**
 REXzyme can be used with the HuggingFace transformer python package.

 # **REXzyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
 **Work in Progress**
+REXzyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine -similar to Google Translator-
 for the generation of enzymes that catalize user-defined reactions.
 ![Inference of REXzyme](./rexzyme3.png)
 Akin to how translation machines have learned to translate between complex language pairs with great success,
 often diverging in their representation at the character level (Japanese - English), we posit that an advanced architecture will
 be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of 2480 reactions
+and ~32M enzyme pairs and it produces sequences that are predicted to perform their intended reactions. A second
+version of the model with 14k more reactions will be uploaded to this repository shortly.
+ you will need to provide a reaction in the SMILES format
+(Simplified molecular-input line-entry system). A useful online server to convert from molecules to SMILES
+can be found here: https://cactus.nci.nih.gov/chemical/structure.
 After converting each of the reaction components you should combine them in the following scheme: ```ReactantA.ReactantB>AgentA>ProductA.ProductB```<br/>
 Additionally, one should prepend the task suffix ```r2s``` and append the eos token ```</s>```
 e.g. for the carbonic anhydrase reaction: ```r2sO.COO>>HCOOO.[H+]</s>```
+We provide this python script to convert reactants to the required reaction format, but
+we always recommend to draw and double-check the structures in a server like [cactus](https://cactus.nci.nih.gov/chemical/structure)
 ```python
+#  left reactants (seperated by '.') seperated by a equal sign from the products (also seperated by '.')
 reactions =  "CO2 . H2O =  carbonic acid . H+"
 # agents (seperated by .)
 agent = ""
 ```
 We are still working in the analysis of the model for different tasks, including experimental testing.
+See below in this documentation information about the models' performance in different in-silico tasks and how to generate your own enzymes.
 ## **Model description**
 REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/google/t5-efficient-large) architecture (which in turn is very similar to the current version of Google Translator)
+and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 770 million parameters.
+REXzyme is a translation machine trained on portion the [RHEA database](https://www.rhea-db.org/) containing 31,970,152 reaction-enzyme pairs.
+A second dataset with >14k reactions is being trained and will be uploaded soon.
+The pre-training was done on pairs of SMILES and amino acid sequences, tokenized with a char-level
+Sentencepiece tokenizer. Note that two seperate tokenizers were used for input (./tokenizer_smiles) and labels (./tokenizer_aa).
+REXzyme was pre-trained with a supervised translation objective  i.e., the model learned to process the continous
+representation of the reaction from the encoder to autoregressively (causual language modeling) produce the output.
+The output tokens (amino acids) are generated one at a time, from left to right, and the model learns to match the original enzyme sequence.
+Hence, the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction.
+There are stark differences in the number of members among reaction classes.
+However, since we are tokenizing the reaction SMILES on a character level,
+the model has learnt dependencies among molecules and enzyme sequence features, and it can transfer learning from more to less populated
+reaction classes.
 ## **Model Performance**
 - **Dataset curation**
+We converted the reactions from rxn format to smile string including only left-to-right reactions.
+The enzyme sequences were truncated to 1024.
+Enzymes catalyzing more than one reaction appear in multiple enzyme-reaction pairs.
 <br/><br/>
 - **General descriptors**
     | Syntax              | Identity    | Alignment length |
     | :---                |    :----:   |          ---:    |
     | **Generated**       | 74.29%      | 406.0            |
+    | **Selection (<70%)** <sup>[1]|</sup> 57.20%      | 338.1            |
 <br/><br/>
+<sup>[1]|</sup> We excluded sequences with %identities ≥ 70% and pLDDTs < 60%.
 ## **How to generate from REXzyme**
 REXzyme can be used with the HuggingFace transformer python package.