File size: 7,060 Bytes
4d40ede bb929a8 4d40ede bb929a8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 | ---
license: mit
language:
- en
tags:
- enzyme
- enzyme-reaction
- reaction-retrieval
- protein-sequence
- protein-language-model
- bioinformatics
- computational-biology
- uncertainty
- mahalanobis-distance
library_name: pytorch
---
# EZHit
**EZHit** is a lightweight enzyme–reaction retrieval model for predicting potential catalytic compatibility between enzyme sequences and biochemical reactions.
Given an enzyme amino-acid sequence and a reaction SMILES, EZHit estimates whether the enzyme is likely to catalyze the reaction. The released checkpoints can be used for enzyme–reaction pair prediction, custom fine-tuning, and uncertainty-aware inference with Mahalanobis-distance-based distribution assessment.
---
## Online demo
An interactive web demo is available at:
[EZHit HuggingFace Space](https://huggingface.co/spaces/deanluo/Enzyme-Catalysis-Predictor)
The Space supports:
- enzyme–reaction pair prediction
- ensemble probability output
- ensemble uncertainty estimation
- Mahalanobis-distance-based reliability assessment
- reaction visualization
---
## Code and Colab notebook
The source code and Colab fine-tuning notebook are available at:
- GitHub repository: [ld139/EzHit](https://github.com/ld139/EzHit)
- Colab fine-tuning notebook: [Open in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb)
The Colab notebook allows users to fine-tune EZHit on their own enzyme–reaction datasets and export a fine-tuned checkpoint together with `train_distribution_stat.pt` for Mahalanobis-distance inference.
---
## Model variants
| Model group | File pattern | Description |
|---|---|---|
| General model | `binarycls_best_val_seed*.pt` | General enzyme–reaction compatibility model |
| Cytochrome P450 model | `ft_p450_best_seed*.pt` | Fine-tuned model for cytochrome P450-related prediction |
| Phosphatase model | `ft_phosphatase_best_seed*.pt` | Fine-tuned model for phosphatase-related prediction |
| Terpene synthase model | `ft_terpene_best_seed*.pt` | Fine-tuned model for terpene synthase-related prediction |
Each model group may contain multiple seed checkpoints for ensemble prediction.
---
## Download checkpoints
Install the HuggingFace Hub client:
```bash
pip install -U huggingface_hub
```
Download a checkpoint:
```python
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="deanluo/EzHit",
filename="checkpoints/binarycls_best_val_seed40.pt"
)
print(ckpt_path)
```
Download Mahalanobis statistics:
```python
from huggingface_hub import hf_hub_download
stat_path = hf_hub_download(
repo_id="deanluo/EzHit",
filename="uncertainty/general_train_distribution_stat.pt"
)
print(stat_path)
```
Download all files from this repository:
```python
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="deanluo/EzHit",
local_dir="EzHit_checkpoints"
)
print(local_dir)
```
---
## Input format
EZHit takes two main inputs:
| Input | Description |
|---|---|
| Enzyme sequence | Amino-acid sequence of the enzyme |
| Reaction SMILES | Reaction in `reactants>>products` format |
Example reaction SMILES:
```text
CCO>>CC=O
```
For fine-tuning, the expected CSV format is:
```csv
protein_sequence,CANO_RXN_SMILES,Label
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG,CCO>>CC=O,1
MKKLLPTAAAGLLLLAAQPAMA,CCO>>CC=O,0
```
Required columns:
| Column | Description |
|---|---|
| `protein_sequence` | Enzyme amino-acid sequence |
| `CANO_RXN_SMILES` | Reaction SMILES in `reactants>>products` format |
| `Label` | Binary label. `1` for compatible enzyme–reaction pairs and `0` for negative pairs |
An optional `split` column can be provided with values `train`, `val`, and `test`.
---
## Output interpretation
EZHit can report the following outputs:
| Output | Description |
|---|---|
| Match probability | Predicted enzyme–reaction compatibility probability |
| Ensemble uncertainty | Model-disagreement-based uncertainty estimate |
| Mahalanobis distance | Latent-space distance from the learned training distribution |
A typical interpretation is:
| Probability | Mahalanobis distance | Interpretation |
|---|---|---|
| High | Low | High-priority candidate |
| High | High | Potentially useful but less reliable or out-of-distribution |
| Low | Low | In-distribution but predicted as incompatible |
| Low | High | Low-priority candidate |
Thresholds should be adjusted based on the model variant, dataset, and validation results.
---
## Mahalanobis-distance statistics
Mahalanobis-distance inference requires a `train_distribution_stat.pt` file generated from the same model architecture and latent dimension as the checkpoint used for prediction.
The expected file contains:
```python
{
"mean": positive_class_latent_mean,
"inv_cov": inverse_covariance_matrix
}
```
The latent dimension of the statistics file must match the hidden dimension of the checkpoint. For example, if the model hidden dimension is 512, the expected shapes are:
```text
mean: [512]
inv_cov: [512, 512]
```
If a checkpoint is fine-tuned with a different hidden dimension, the corresponding Mahalanobis statistics must be regenerated.
For very small fine-tuning datasets, covariance estimation may be unstable. In such cases, Mahalanobis distance should be interpreted cautiously together with probability and ensemble uncertainty.
---
## Fine-tuning
Users can fine-tune EZHit using the Colab notebook:
[Open EZHit fine-tuning notebook in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb)
The fine-tuning workflow exports:
| File | Description |
|---|---|
| `ezhit_finetuned_seed42.pt` | Fine-tuned checkpoint |
| `train_distribution_stat.pt` | Training-distribution statistics for Mahalanobis-distance inference |
| `val_predictions.csv` | Validation-set predictions |
| `test_predictions.csv` | Test-set predictions |
The fine-tuned checkpoint and `train_distribution_stat.pt` can be used for customized inference.
---
## Large-scale screening results
Large-scale enzyme–reaction screening results are provided separately as a HuggingFace Dataset repository:
`TODO: add dataset repository link`
Recommended location:
```text
https://huggingface.co/datasets/deanluo/EzHit-screening-results
```
The complete training and benchmark datasets will be archived separately on Zenodo:
`TODO: add Zenodo link`
---
## Installation for local use
Clone the code repository:
```bash
git clone https://github.com/ld139/EzHit.git
cd EzHit
```
Install dependencies:
```bash
pip install -r requirements.txt
```
The required `kan.py` implementation is already included in the GitHub repository. No separate KAN package installation is required.
---
## License
This project is released under the MIT License.
---
## Contact
For questions, please use the GitHub Issues page:
[https://github.com/ld139/EzHit/issues](https://github.com/ld139/EzHit/issues)
|