File size: 7,060 Bytes
4d40ede
 
bb929a8
 
 
 
 
 
 
 
 
 
 
 
 
4d40ede
bb929a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
license: mit
language:
  - en
tags:
  - enzyme
  - enzyme-reaction
  - reaction-retrieval
  - protein-sequence
  - protein-language-model
  - bioinformatics
  - computational-biology
  - uncertainty
  - mahalanobis-distance
library_name: pytorch
---

# EZHit

**EZHit** is a lightweight enzyme–reaction retrieval model for predicting potential catalytic compatibility between enzyme sequences and biochemical reactions.

Given an enzyme amino-acid sequence and a reaction SMILES, EZHit estimates whether the enzyme is likely to catalyze the reaction. The released checkpoints can be used for enzyme–reaction pair prediction, custom fine-tuning, and uncertainty-aware inference with Mahalanobis-distance-based distribution assessment.

---

## Online demo

An interactive web demo is available at:

[EZHit HuggingFace Space](https://huggingface.co/spaces/deanluo/Enzyme-Catalysis-Predictor)

The Space supports:

- enzyme–reaction pair prediction
- ensemble probability output
- ensemble uncertainty estimation
- Mahalanobis-distance-based reliability assessment
- reaction visualization

---

## Code and Colab notebook

The source code and Colab fine-tuning notebook are available at:

- GitHub repository: [ld139/EzHit](https://github.com/ld139/EzHit)
- Colab fine-tuning notebook: [Open in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb)

The Colab notebook allows users to fine-tune EZHit on their own enzyme–reaction datasets and export a fine-tuned checkpoint together with `train_distribution_stat.pt` for Mahalanobis-distance inference.

---


## Model variants

| Model group | File pattern | Description |
|---|---|---|
| General model | `binarycls_best_val_seed*.pt` | General enzyme–reaction compatibility model |
| Cytochrome P450 model | `ft_p450_best_seed*.pt` | Fine-tuned model for cytochrome P450-related prediction |
| Phosphatase model | `ft_phosphatase_best_seed*.pt` | Fine-tuned model for phosphatase-related prediction |
| Terpene synthase model | `ft_terpene_best_seed*.pt` | Fine-tuned model for terpene synthase-related prediction |

Each model group may contain multiple seed checkpoints for ensemble prediction.

---

## Download checkpoints

Install the HuggingFace Hub client:

```bash
pip install -U huggingface_hub
```

Download a checkpoint:

```python
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="deanluo/EzHit",
    filename="checkpoints/binarycls_best_val_seed40.pt"
)

print(ckpt_path)
```

Download Mahalanobis statistics:

```python
from huggingface_hub import hf_hub_download

stat_path = hf_hub_download(
    repo_id="deanluo/EzHit",
    filename="uncertainty/general_train_distribution_stat.pt"
)

print(stat_path)
```

Download all files from this repository:

```python
from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="deanluo/EzHit",
    local_dir="EzHit_checkpoints"
)

print(local_dir)
```

---

## Input format

EZHit takes two main inputs:

| Input | Description |
|---|---|
| Enzyme sequence | Amino-acid sequence of the enzyme |
| Reaction SMILES | Reaction in `reactants>>products` format |

Example reaction SMILES:

```text
CCO>>CC=O
```

For fine-tuning, the expected CSV format is:

```csv
protein_sequence,CANO_RXN_SMILES,Label
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG,CCO>>CC=O,1
MKKLLPTAAAGLLLLAAQPAMA,CCO>>CC=O,0
```

Required columns:

| Column | Description |
|---|---|
| `protein_sequence` | Enzyme amino-acid sequence |
| `CANO_RXN_SMILES` | Reaction SMILES in `reactants>>products` format |
| `Label` | Binary label. `1` for compatible enzyme–reaction pairs and `0` for negative pairs |

An optional `split` column can be provided with values `train`, `val`, and `test`.

---

## Output interpretation

EZHit can report the following outputs:

| Output | Description |
|---|---|
| Match probability | Predicted enzyme–reaction compatibility probability |
| Ensemble uncertainty | Model-disagreement-based uncertainty estimate |
| Mahalanobis distance | Latent-space distance from the learned training distribution |

A typical interpretation is:

| Probability | Mahalanobis distance | Interpretation |
|---|---|---|
| High | Low | High-priority candidate |
| High | High | Potentially useful but less reliable or out-of-distribution |
| Low | Low | In-distribution but predicted as incompatible |
| Low | High | Low-priority candidate |

Thresholds should be adjusted based on the model variant, dataset, and validation results.

---

## Mahalanobis-distance statistics

Mahalanobis-distance inference requires a `train_distribution_stat.pt` file generated from the same model architecture and latent dimension as the checkpoint used for prediction.

The expected file contains:

```python
{
    "mean": positive_class_latent_mean,
    "inv_cov": inverse_covariance_matrix
}
```

The latent dimension of the statistics file must match the hidden dimension of the checkpoint. For example, if the model hidden dimension is 512, the expected shapes are:

```text
mean:    [512]
inv_cov: [512, 512]
```

If a checkpoint is fine-tuned with a different hidden dimension, the corresponding Mahalanobis statistics must be regenerated.

For very small fine-tuning datasets, covariance estimation may be unstable. In such cases, Mahalanobis distance should be interpreted cautiously together with probability and ensemble uncertainty.

---

## Fine-tuning

Users can fine-tune EZHit using the Colab notebook:

[Open EZHit fine-tuning notebook in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb)

The fine-tuning workflow exports:

| File | Description |
|---|---|
| `ezhit_finetuned_seed42.pt` | Fine-tuned checkpoint |
| `train_distribution_stat.pt` | Training-distribution statistics for Mahalanobis-distance inference |
| `val_predictions.csv` | Validation-set predictions |
| `test_predictions.csv` | Test-set predictions |

The fine-tuned checkpoint and `train_distribution_stat.pt` can be used for customized inference.

---

## Large-scale screening results

Large-scale enzyme–reaction screening results are provided separately as a HuggingFace Dataset repository:

`TODO: add dataset repository link`

Recommended location:

```text
https://huggingface.co/datasets/deanluo/EzHit-screening-results
```

The complete training and benchmark datasets will be archived separately on Zenodo:

`TODO: add Zenodo link`

---

## Installation for local use

Clone the code repository:

```bash
git clone https://github.com/ld139/EzHit.git
cd EzHit
```

Install dependencies:

```bash
pip install -r requirements.txt
```

The required `kan.py` implementation is already included in the GitHub repository. No separate KAN package installation is required.

---


## License

This project is released under the MIT License.

---

## Contact

For questions, please use the GitHub Issues page:

[https://github.com/ld139/EzHit/issues](https://github.com/ld139/EzHit/issues)