Update README.md
Browse files
README.md
CHANGED
|
@@ -48,7 +48,7 @@ REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/
|
|
| 48 |
and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 770 million parameters.
|
| 49 |
|
| 50 |
REXzyme is a translation machine trained on portion the [RHEA database](https://www.rhea-db.org/) containing 20,911,485 reaction-enzyme pairs.
|
| 51 |
-
The pre-training was done on pairs of SMILES and amino acid sequences. Note that two seperate tokenizers were used for input (./
|
| 52 |
|
| 53 |
REXzyme was pre-trained with a supervised translation objective i.e., the model learned to process the continous
|
| 54 |
representation of the reaction from the encoder to autoregressively (causual language modeling) produce the output.
|
|
@@ -66,96 +66,7 @@ reaction classes.
|
|
| 66 |
We converted the reactions from rxn format to smile string including only left-to-right reactions.
|
| 67 |
The enzyme sequences were truncated to 1024.
|
| 68 |
Enzymes catalyzing more than one reaction appear in multiple enzyme-reaction pairs.
|
| 69 |
-
|
| 70 |
-
- **General descriptors**
|
| 71 |
-
|
| 72 |
-
| Method | Natural | Generated <sup>[1]</sup> |
|
| 73 |
-
| :--- | :----: | ---: |
|
| 74 |
-
| **IUPRED3 (ordered)** | 99.9% | 99.9% |
|
| 75 |
-
| **ESMFold (avg. plddt)** | 85.03 | 79.82 |
|
| 76 |
-
| **FlDPnn** | 0.0878 | 0.0929 |
|
| 77 |
-
<sup>[1]|</sup> We excluded sequences with %identities ≥ 70% and pLDDTs < 60%.
|
| 78 |
-
<br/><br/>
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
- **Functional classification**
|
| 82 |
-
<br/><br/>
|
| 83 |
-
<table>
|
| 84 |
-
<tr>
|
| 85 |
-
<td><b>Method </b></td>
|
| 86 |
-
<td colspan="2"> <a href="https://google-research.github.io/proteinfer/">ProteInfer</a></td>
|
| 87 |
-
<td colspan="2"> <a href="https://www.science.org/doi/10.1126/science.adf2465">CLEAN</a></td>
|
| 88 |
-
</tr>
|
| 89 |
-
<tr>
|
| 90 |
-
<td><b>Dataset</b></td>
|
| 91 |
-
<td >Natural (%) </td>
|
| 92 |
-
<td >Generated (%) </td>
|
| 93 |
-
<td >Natural (%) </td>
|
| 94 |
-
<td >Generated (%) </td>
|
| 95 |
-
</tr>
|
| 96 |
-
<tr>
|
| 97 |
-
<td><b> EC: Level 1 </b></td>
|
| 98 |
-
<td >81</td>
|
| 99 |
-
<td >80</td>
|
| 100 |
-
<td >80</td>
|
| 101 |
-
<td >79</td>
|
| 102 |
-
</tr>
|
| 103 |
-
<tr>
|
| 104 |
-
<td><b> EC: Level 2 </b></td>
|
| 105 |
-
<td >78</td>
|
| 106 |
-
<td >77</td>
|
| 107 |
-
<td >79</td>
|
| 108 |
-
<td >78</td>
|
| 109 |
-
</tr>
|
| 110 |
-
<tr>
|
| 111 |
-
<td><b> EC: Level 3 </b></td>
|
| 112 |
-
<td >76</td>
|
| 113 |
-
<td >75</td>
|
| 114 |
-
<td >78</td>
|
| 115 |
-
<td >77</td>
|
| 116 |
-
</tr>
|
| 117 |
-
<tr>
|
| 118 |
-
<td><b> EC: Level 4 </b></td>
|
| 119 |
-
<td >62</td>
|
| 120 |
-
<td >58</td>
|
| 121 |
-
<td >70</td>
|
| 122 |
-
<td >65</td>
|
| 123 |
-
</tr>
|
| 124 |
-
<tr>
|
| 125 |
-
<td><b> No EC predicted </b></td>
|
| 126 |
-
<td >10</td>
|
| 127 |
-
<td >7</td>
|
| 128 |
-
<td >0</td>
|
| 129 |
-
<td >0</td>
|
| 130 |
-
</tr>
|
| 131 |
-
<tr>
|
| 132 |
-
<td><b> GO-Terms </b></td>
|
| 133 |
-
<td >41</td>
|
| 134 |
-
<td >39</td>
|
| 135 |
-
<td >-</td>
|
| 136 |
-
<td >-</td>
|
| 137 |
-
</tr>
|
| 138 |
-
<tr>
|
| 139 |
-
<td><b> No GO predicted </b></td>
|
| 140 |
-
<td >1</td>
|
| 141 |
-
<td >1</td>
|
| 142 |
-
<td >-</td>
|
| 143 |
-
<td >-</td>
|
| 144 |
-
</tr>
|
| 145 |
-
</table>
|
| 146 |
-
<br/><br/>
|
| 147 |
-
- **PGP pipeline** [(see GitHub)](https://github.com/hefeda/PGP)
|
| 148 |
-
|
| 149 |
-
| Method | Natural | Generated |
|
| 150 |
-
| :--- | :---- | :--- |
|
| 151 |
-
| **Disorder** | 11.473 | 11.467 |
|
| 152 |
-
| **DSSP3** | L: 42%, H: 41%, E:18% | L: 45%, H: 39%, E: 16%|
|
| 153 |
-
| **DSSP8** | C:25%, H:38% T:10%, S:5%, I:0%, E:19%, G:2%, B:0% | C:29%, H:38% T:10%, S:4%, I:0%, E:17%, G:3%, B:0%|
|
| 154 |
-
| **CATH Classes** | Mainly Beta: 6%, Alpha Beta: 78%, Mainly Alpha: 16%, Special: 0%, Few Secondary Structures: 0% | Mainly Beta: 4%, Alpha Beta: 87%, Mainly Alpha: 9%, Special: 0%, Few Secondary Structures: 0%|
|
| 155 |
-
| **Transmembrane Prediction** | Membrane: 9%, Soluble: 91% | Membrane: 9%, Soluble: 91%|
|
| 156 |
-
| **Conservation** | High: 37%, Low: 33% | High: 38%, Low: 33% |
|
| 157 |
-
| **Localization** | Cytop.: 66%, Nucleus: 4%, Extracellular: 6%, PM: 4%, ER: 11%, Lysosome/Vacuole: 1%, Mito.: 6%, Plastid: 1%, Golgi: 1%, Perox.: 1% | Cytop.: 85%, Nucleus: 2%, Extracellular: 6%, PM: 1%, ER: 6%, Lysosome/Vacuole: 0%, Mito.: 4%, Plastid: 0%, Golgi: 0%, Perox.: 0%|
|
| 158 |
-
<br/><br/>
|
| 159 |
|
| 160 |
|
| 161 |
## **How to generate from REXzyme**
|
|
|
|
| 48 |
and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 770 million parameters.
|
| 49 |
|
| 50 |
REXzyme is a translation machine trained on portion the [RHEA database](https://www.rhea-db.org/) containing 20,911,485 reaction-enzyme pairs.
|
| 51 |
+
The pre-training was done on pairs of SMILES and amino acid sequences. Note that two seperate tokenizers were used for input (./tokenizer_aa-ABPE_SMILES/tokenizer_ABPE_rexzyme_offset) and labels (./tokenizer_aa).
|
| 52 |
|
| 53 |
REXzyme was pre-trained with a supervised translation objective i.e., the model learned to process the continous
|
| 54 |
representation of the reaction from the encoder to autoregressively (causual language modeling) produce the output.
|
|
|
|
| 66 |
We converted the reactions from rxn format to smile string including only left-to-right reactions.
|
| 67 |
The enzyme sequences were truncated to 1024.
|
| 68 |
Enzymes catalyzing more than one reaction appear in multiple enzyme-reaction pairs.
|
| 69 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
|
| 72 |
## **How to generate from REXzyme**
|