AI4PD
/

REXzyme

 ---
 license: apache-2.0
+pipeline_tag: translation
+tags:
+- chemistry
+- biology
 ---
+# **Contributors**
+- Sebastian Lindner (GitHub [@Bienenwolf655](https://www.google.com); Twitter @)
+- Michael Heinzinger (GitHub @mheinzinger; Twitter @)
+- Noelia Ferruz (GitHub @noeliaferruz; Twitter @ferruz_noelia; Webpage: www.aiproteindesign.com )
+# **REXyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
+**Work in Progress**
+REXyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine for the generation of enzyme that catalize user-defined reactions.
+It is possible to provide fine-grained input at the substrate level.
+Akin to how translation machines have learned to translate between complex language pairs with great success,
+often diverging in their representation at the character level, (Japanese - English), we posit that an advanced architecture will
+be able to translate between the chemical and sequence spaces. REXyme was trained on a set of xx reactions and yy enzyme pairs and it produces
+sequences that putatitely perform their intended reactions.
+To run it, you will need to provide a reaction in the SMILE format (Simplified molecular-input line-entry system),
+ which you can do online here: xxxx
+We are still working in the analysis of the model for different tasks, including experimental testing.
+See below for information about the models' performance in different in-silico tasks and how to generate your own enzymes.
+## **Model description**
+REXyme is based on the [Efficient T5 Transformer](xx) architecture (which in turn is very similar to the current version of Google Translator)
+and contains xx layers
+with a model dimensionality of xx, totaling xx million parameters.
+REXyme is a translation machine trained on the xx database containing xx reaction-enzyme pairs.
+The pre-training was done on pairs of smiles and ... (fasta headers?),
+ZymCTRL was trained with an autoregressive objective (this is not right, check it ??) i.e., the model learns to predict a missing
+token in the encoder's input. Hence,
+the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction.
+Sebastian check if this applies?? There are stark differences in the number of members among EC classes, and for this reason, we also tokenized the EC numbers.
+In this manner, EC numbers '2.7.1.1' and '2.7.1.2' share the first three tokens (six, including separators), and hence the model can infer that
+there are relationships between the two classes.
+The figure below summarizes the process of training: (add figure)
+## **Model Performance**
+- explain dataset curation
+- general descriptors (esmfold, iuored.. )
+- second pgp
+- mmseqs (Average?)
+## **How to generate from REXyme**
+REXyme can be used with the HuggingFace transformer python package.
+Detailed installation instructions can be found here: https://huggingface.co/docs/transformers/installation
+Since REXyme has been trained on the objective of machine translation, users have to specify a chemical reaction, specified in the format of SMILES.
+[please seb include snippet to generate sequences]
+## **A word of caution**
+- We have not yet fully tested the ability of the model for the generation of new-to-nature enzymes, i.e.,
+  with chemical reactions that do not appear in Nature (and hence neither in the training set). While this is the intended objective of our work,
+  it is very much work in progress. We'll uptadate the model and documentation shortly.