AI4PD
/

REXzyme

@@ -48,7 +48,7 @@ REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/
 and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 770 million parameters.
 REXzyme is a translation machine trained on portion the [RHEA database](https://www.rhea-db.org/) containing 20,911,485 reaction-enzyme pairs.
-The pre-training was done on pairs of SMILES and amino acid sequences. Note that two seperate tokenizers were used for input (./tokenizer_smiles) and labels (./tokenizer_aa).
 REXzyme was pre-trained with a supervised translation objective  i.e., the model learned to process the continous
 representation of the reaction from the encoder to autoregressively (causual language modeling) produce the output.
@@ -66,96 +66,7 @@ reaction classes.
 We converted the reactions from rxn format to smile string including only left-to-right reactions.
 The enzyme sequences were truncated to 1024.
 Enzymes catalyzing more than one reaction appear in multiple enzyme-reaction pairs.
-<br/><br/>
-- **General descriptors**
-    | Method                | Natural     | Generated <sup>[1]</sup> |
-    | :---                  |    :----:   |          ---:            |
-    | **IUPRED3 (ordered)** | 99.9%       | 99.9%                    |
-    | **ESMFold (avg. plddt)**            | 85.03       | 79.82  |
-    | **FlDPnn**            | 0.0878      | 0.0929                  |
-<sup>[1]|</sup> We excluded sequences with %identities ≥ 70% and pLDDTs < 60%.
-<br/><br/>
-- **Functional classification**
-<br/><br/>
-<table>
-  <tr>
-    <td><b>Method </b></td>
-    <td colspan="2"> <a href="https://google-research.github.io/proteinfer/">ProteInfer</a></td>
-    <td colspan="2"> <a href="https://www.science.org/doi/10.1126/science.adf2465">CLEAN</a></td>
-  </tr>
-  <tr>
-    <td><b>Dataset</b></td>
-    <td >Natural (%) </td>
-    <td >Generated (%) </td>
-    <td >Natural (%) </td>
-    <td >Generated (%) </td>
-  </tr>
-    <tr>
-    <td><b> EC: Level 1 </b></td>
-    <td >81</td>
-    <td >80</td>
-    <td >80</td>
-    <td >79</td>
-  </tr>
-    <tr>
-    <td><b> EC: Level 2 </b></td>
-    <td >78</td>
-    <td >77</td>
-    <td >79</td>
-    <td >78</td>
-  </tr>
-    <tr>
-    <td><b> EC: Level 3 </b></td>
-    <td >76</td>
-    <td >75</td>
-    <td >78</td>
-    <td >77</td>
-  </tr>
-    <tr>
-    <td><b> EC: Level 4 </b></td>
-    <td >62</td>
-    <td >58</td>
-    <td >70</td>
-    <td >65</td>
-  </tr>
-    <tr>
-    <td><b> No EC predicted </b></td>
-    <td >10</td>
-    <td >7</td>
-    <td >0</td>
-    <td >0</td>
-  </tr>
-    <tr>
-    <td><b> GO-Terms </b></td>
-    <td >41</td>
-    <td >39</td>
-    <td >-</td>
-    <td >-</td>
-  </tr>
-    <tr>
-    <td><b> No GO predicted </b></td>
-    <td >1</td>
-    <td >1</td>
-    <td >-</td>
-    <td >-</td>
-  </tr>
-</table>
-<br/><br/>
-- **PGP pipeline** [(see GitHub)](https://github.com/hefeda/PGP)
-    | Method      | Natural | Generated |
-    | :---        | :----   |      :--- |
-    | **Disorder**      | 11.473       | 11.467   |
-    | **DSSP3**   | L: 42%, H: 41%, E:18% | L: 45%, H: 39%, E: 16%|
-    | **DSSP8**   | C:25%, H:38% T:10%, S:5%, I:0%, E:19%, G:2%, B:0% | C:29%, H:38% T:10%, S:4%, I:0%, E:17%, G:3%, B:0%|
-    | **CATH Classes**   | Mainly Beta: 6%, Alpha Beta: 78%, Mainly Alpha: 16%, Special: 0%, Few Secondary Structures: 0% |  Mainly Beta: 4%, Alpha Beta: 87%, Mainly Alpha: 9%, Special: 0%, Few Secondary Structures: 0%|
-    | **Transmembrane Prediction**  | Membrane: 9%, Soluble: 91% | Membrane: 9%, Soluble: 91%|
-    | **Conservation**      | High: 37%, Low: 33%      | High: 38%, Low: 33%  |
-    | **Localization**   | Cytop.: 66%, Nucleus: 4%, Extracellular: 6%, PM: 4%, ER: 11%, Lysosome/Vacuole: 1%, Mito.: 6%, Plastid: 1%, Golgi: 1%, Perox.: 1% | Cytop.: 85%, Nucleus: 2%, Extracellular: 6%, PM: 1%, ER: 6%, Lysosome/Vacuole: 0%, Mito.: 4%, Plastid: 0%, Golgi: 0%, Perox.: 0%|
-<br/><br/>
 ## **How to generate from REXzyme**

 and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 770 million parameters.
 REXzyme is a translation machine trained on portion the [RHEA database](https://www.rhea-db.org/) containing 20,911,485 reaction-enzyme pairs.
+The pre-training was done on pairs of SMILES and amino acid sequences. Note that two seperate tokenizers were used for input (./tokenizer_aa-ABPE_SMILES/tokenizer_ABPE_rexzyme_offset) and labels (./tokenizer_aa).
 REXzyme was pre-trained with a supervised translation objective  i.e., the model learned to process the continous
 representation of the reaction from the encoder to autoregressively (causual language modeling) produce the output.
 We converted the reactions from rxn format to smile string including only left-to-right reactions.
 The enzyme sequences were truncated to 1024.
 Enzymes catalyzing more than one reaction appear in multiple enzyme-reaction pairs.
 ## **How to generate from REXzyme**