|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: translation |
|
|
tags: |
|
|
- chemistry |
|
|
- biology |
|
|
--- |
|
|
|
|
|
# **Contributors** |
|
|
|
|
|
- Sebastian Lindner (GitHub [@Bienenwolf655](https://www.google.com); Twitter @) |
|
|
- Michael Heinzinger (GitHub @mheinzinger; Twitter @) |
|
|
- Noelia Ferruz (GitHub @noeliaferruz; Twitter @ferruz_noelia; Webpage: www.aiproteindesign.com ) |
|
|
|
|
|
# **REXyme: A Translation Machine for the Generation of New-to-Nature Enzymes** |
|
|
**Work in Progress** |
|
|
|
|
|
REXyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine for the generation of enzyme that catalize user-defined reactions. |
|
|
It is possible to provide fine-grained input at the substrate level. |
|
|
Akin to how translation machines have learned to translate between complex language pairs with great success, |
|
|
often diverging in their representation at the character level, (Japanese - English), we posit that an advanced architecture will |
|
|
be able to translate between the chemical and sequence spaces. REXyme was trained on a set of xx reactions and yy enzyme pairs and it produces |
|
|
sequences that putatitely perform their intended reactions. |
|
|
|
|
|
To run it, you will need to provide a reaction in the SMILE format (Simplified molecular-input line-entry system), |
|
|
which you can do online here: xxxx |
|
|
|
|
|
We are still working in the analysis of the model for different tasks, including experimental testing. |
|
|
See below for information about the models' performance in different in-silico tasks and how to generate your own enzymes. |
|
|
|
|
|
## **Model description** |
|
|
REXyme is based on the [Efficient T5 Transformer](xx) architecture (which in turn is very similar to the current version of Google Translator) |
|
|
and contains xx layers |
|
|
with a model dimensionality of xx, totaling xx million parameters. |
|
|
|
|
|
REXyme is a translation machine trained on the xx database containing xx reaction-enzyme pairs. |
|
|
The pre-training was done on pairs of smiles and ... (fasta headers?), |
|
|
|
|
|
ZymCTRL was trained with an autoregressive objective (this is not right, check it ??) i.e., the model learns to predict a missing |
|
|
token in the encoder's input. Hence, |
|
|
the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction. |
|
|
|
|
|
Sebastian check if this applies?? There are stark differences in the number of members among EC classes, and for this reason, we also tokenized the EC numbers. |
|
|
In this manner, EC numbers '2.7.1.1' and '2.7.1.2' share the first three tokens (six, including separators), and hence the model can infer that |
|
|
there are relationships between the two classes. |
|
|
|
|
|
The figure below summarizes the process of training: (add figure) |
|
|
|
|
|
## **Model Performance** |
|
|
|
|
|
- explain dataset curation |
|
|
- general descriptors (esmfold, iuored.. ) |
|
|
- second pgp |
|
|
- mmseqs (Average?) |
|
|
|
|
|
|
|
|
## **How to generate from REXyme** |
|
|
REXyme can be used with the HuggingFace transformer python package. |
|
|
Detailed installation instructions can be found here: https://huggingface.co/docs/transformers/installation |
|
|
|
|
|
Since REXyme has been trained on the objective of machine translation, users have to specify a chemical reaction, specified in the format of SMILES. |
|
|
|
|
|
[please seb include snippet to generate sequences] |
|
|
|
|
|
|
|
|
## **A word of caution** |
|
|
|
|
|
- We have not yet fully tested the ability of the model for the generation of new-to-nature enzymes, i.e., |
|
|
with chemical reactions that do not appear in Nature (and hence neither in the training set). While this is the intended objective of our work, |
|
|
it is very much work in progress. We'll uptadate the model and documentation shortly. |
|
|
|
|
|
|
|
|
|
|
|
|