File size: 7,060 Bytes

4d40ede
 
bb929a8
 
 
 
 
 
 
 
 
 
 
 
 
4d40ede
bb929a8

---
license: mit
language:
  - en
tags:
  - enzyme
  - enzyme-reaction
  - reaction-retrieval
  - protein-sequence
  - protein-language-model
  - bioinformatics
  - computational-biology
  - uncertainty
  - mahalanobis-distance
library_name: pytorch
---

# EZHit

**EZHit** is a lightweight enzyme–reaction retrieval model for predicting potential catalytic compatibility between enzyme sequences and biochemical reactions.

Given an enzyme amino-acid sequence and a reaction SMILES, EZHit estimates whether the enzyme is likely to catalyze the reaction. The released checkpoints can be used for enzyme–reaction pair prediction, custom fine-tuning, and uncertainty-aware inference with Mahalanobis-distance-based distribution assessment.

---

## Online demo

An interactive web demo is available at:

[EZHit HuggingFace Space](https://huggingface.co/spaces/deanluo/Enzyme-Catalysis-Predictor)

The Space supports:

- enzyme–reaction pair prediction
- ensemble probability output
- ensemble uncertainty estimation
- Mahalanobis-distance-based reliability assessment
- reaction visualization

---

## Code and Colab notebook

The source code and Colab fine-tuning notebook are available at:

- GitHub repository: [ld139/EzHit](https://github.com/ld139/EzHit)
- Colab fine-tuning notebook: [Open in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb)

The Colab notebook allows users to fine-tune EZHit on their own enzyme–reaction datasets and export a fine-tuned checkpoint together with `train_distribution_stat.pt` for Mahalanobis-distance inference.

---


## Model variants

| Model group | File pattern | Description |
|---|---|---|
| General model | `binarycls_best_val_seed*.pt` | General enzyme–reaction compatibility model |
| Cytochrome P450 model | `ft_p450_best_seed*.pt` | Fine-tuned model for cytochrome P450-related prediction |
| Phosphatase model | `ft_phosphatase_best_seed*.pt` | Fine-tuned model for phosphatase-related prediction |
| Terpene synthase model | `ft_terpene_best_seed*.pt` | Fine-tuned model for terpene synthase-related prediction |

Each model group may contain multiple seed checkpoints for ensemble prediction.

---

## Download checkpoints

Install the HuggingFace Hub client:

```bash
pip install -U huggingface_hub
```

Download a checkpoint:

```python
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="deanluo/EzHit",
    filename="checkpoints/binarycls_best_val_seed40.pt"
)

print(ckpt_path)
```

Download Mahalanobis statistics:

```python
from huggingface_hub import hf_hub_download

stat_path = hf_hub_download(
    repo_id="deanluo/EzHit",
    filename="uncertainty/general_train_distribution_stat.pt"
)

print(stat_path)
```

Download all files from this repository:

```python
from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="deanluo/EzHit",
    local_dir="EzHit_checkpoints"
)

print(local_dir)
```

---

## Input format

EZHit takes two main inputs:

| Input | Description |
|---|---|
| Enzyme sequence | Amino-acid sequence of the enzyme |
| Reaction SMILES | Reaction in `reactants>>products` format |

Example reaction SMILES:

```text
CCO>>CC=O
```

For fine-tuning, the expected CSV format is:

```csv
protein_sequence,CANO_RXN_SMILES,Label
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG,CCO>>CC=O,1
MKKLLPTAAAGLLLLAAQPAMA,CCO>>CC=O,0
```

Required columns:

| Column | Description |
|---|---|
| `protein_sequence` | Enzyme amino-acid sequence |
| `CANO_RXN_SMILES` | Reaction SMILES in `reactants>>products` format |
| `Label` | Binary label. `1` for compatible enzyme–reaction pairs and `0` for negative pairs |

An optional `split` column can be provided with values `train`, `val`, and `test`.

---

## Output interpretation

EZHit can report the following outputs:

| Output | Description |
|---|---|
| Match probability | Predicted enzyme–reaction compatibility probability |
| Ensemble uncertainty | Model-disagreement-based uncertainty estimate |
| Mahalanobis distance | Latent-space distance from the learned training distribution |

A typical interpretation is:

| Probability | Mahalanobis distance | Interpretation |
|---|---|---|
| High | Low | High-priority candidate |
| High | High | Potentially useful but less reliable or out-of-distribution |
| Low | Low | In-distribution but predicted as incompatible |
| Low | High | Low-priority candidate |

Thresholds should be adjusted based on the model variant, dataset, and validation results.

---

## Mahalanobis-distance statistics

Mahalanobis-distance inference requires a `train_distribution_stat.pt` file generated from the same model architecture and latent dimension as the checkpoint used for prediction.

The expected file contains:

```python
{
    "mean": positive_class_latent_mean,
    "inv_cov": inverse_covariance_matrix
}
```

The latent dimension of the statistics file must match the hidden dimension of the checkpoint. For example, if the model hidden dimension is 512, the expected shapes are:

```text
mean:    [512]
inv_cov: [512, 512]
```

If a checkpoint is fine-tuned with a different hidden dimension, the corresponding Mahalanobis statistics must be regenerated.

For very small fine-tuning datasets, covariance estimation may be unstable. In such cases, Mahalanobis distance should be interpreted cautiously together with probability and ensemble uncertainty.

---

## Fine-tuning

Users can fine-tune EZHit using the Colab notebook:

[Open EZHit fine-tuning notebook in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb)

The fine-tuning workflow exports:

| File | Description |
|---|---|
| `ezhit_finetuned_seed42.pt` | Fine-tuned checkpoint |
| `train_distribution_stat.pt` | Training-distribution statistics for Mahalanobis-distance inference |
| `val_predictions.csv` | Validation-set predictions |
| `test_predictions.csv` | Test-set predictions |

The fine-tuned checkpoint and `train_distribution_stat.pt` can be used for customized inference.

---

## Large-scale screening results

Large-scale enzyme–reaction screening results are provided separately as a HuggingFace Dataset repository:

`TODO: add dataset repository link`

Recommended location:

```text
https://huggingface.co/datasets/deanluo/EzHit-screening-results
```

The complete training and benchmark datasets will be archived separately on Zenodo:

`TODO: add Zenodo link`

---

## Installation for local use

Clone the code repository:

```bash
git clone https://github.com/ld139/EzHit.git
cd EzHit
```

Install dependencies:

```bash
pip install -r requirements.txt
```

The required `kan.py` implementation is already included in the GitHub repository. No separate KAN package installation is required.

---


## License

This project is released under the MIT License.

---

## Contact

For questions, please use the GitHub Issues page:

[https://github.com/ld139/EzHit/issues](https://github.com/ld139/EzHit/issues)