EzHit / README.md
deanluo's picture
Update README.md
bb929a8 verified
---
license: mit
language:
- en
tags:
- enzyme
- enzyme-reaction
- reaction-retrieval
- protein-sequence
- protein-language-model
- bioinformatics
- computational-biology
- uncertainty
- mahalanobis-distance
library_name: pytorch
---
# EZHit
**EZHit** is a lightweight enzyme–reaction retrieval model for predicting potential catalytic compatibility between enzyme sequences and biochemical reactions.
Given an enzyme amino-acid sequence and a reaction SMILES, EZHit estimates whether the enzyme is likely to catalyze the reaction. The released checkpoints can be used for enzyme–reaction pair prediction, custom fine-tuning, and uncertainty-aware inference with Mahalanobis-distance-based distribution assessment.
---
## Online demo
An interactive web demo is available at:
[EZHit HuggingFace Space](https://huggingface.co/spaces/deanluo/Enzyme-Catalysis-Predictor)
The Space supports:
- enzyme–reaction pair prediction
- ensemble probability output
- ensemble uncertainty estimation
- Mahalanobis-distance-based reliability assessment
- reaction visualization
---
## Code and Colab notebook
The source code and Colab fine-tuning notebook are available at:
- GitHub repository: [ld139/EzHit](https://github.com/ld139/EzHit)
- Colab fine-tuning notebook: [Open in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb)
The Colab notebook allows users to fine-tune EZHit on their own enzyme–reaction datasets and export a fine-tuned checkpoint together with `train_distribution_stat.pt` for Mahalanobis-distance inference.
---
## Model variants
| Model group | File pattern | Description |
|---|---|---|
| General model | `binarycls_best_val_seed*.pt` | General enzyme–reaction compatibility model |
| Cytochrome P450 model | `ft_p450_best_seed*.pt` | Fine-tuned model for cytochrome P450-related prediction |
| Phosphatase model | `ft_phosphatase_best_seed*.pt` | Fine-tuned model for phosphatase-related prediction |
| Terpene synthase model | `ft_terpene_best_seed*.pt` | Fine-tuned model for terpene synthase-related prediction |
Each model group may contain multiple seed checkpoints for ensemble prediction.
---
## Download checkpoints
Install the HuggingFace Hub client:
```bash
pip install -U huggingface_hub
```
Download a checkpoint:
```python
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="deanluo/EzHit",
filename="checkpoints/binarycls_best_val_seed40.pt"
)
print(ckpt_path)
```
Download Mahalanobis statistics:
```python
from huggingface_hub import hf_hub_download
stat_path = hf_hub_download(
repo_id="deanluo/EzHit",
filename="uncertainty/general_train_distribution_stat.pt"
)
print(stat_path)
```
Download all files from this repository:
```python
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="deanluo/EzHit",
local_dir="EzHit_checkpoints"
)
print(local_dir)
```
---
## Input format
EZHit takes two main inputs:
| Input | Description |
|---|---|
| Enzyme sequence | Amino-acid sequence of the enzyme |
| Reaction SMILES | Reaction in `reactants>>products` format |
Example reaction SMILES:
```text
CCO>>CC=O
```
For fine-tuning, the expected CSV format is:
```csv
protein_sequence,CANO_RXN_SMILES,Label
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG,CCO>>CC=O,1
MKKLLPTAAAGLLLLAAQPAMA,CCO>>CC=O,0
```
Required columns:
| Column | Description |
|---|---|
| `protein_sequence` | Enzyme amino-acid sequence |
| `CANO_RXN_SMILES` | Reaction SMILES in `reactants>>products` format |
| `Label` | Binary label. `1` for compatible enzyme–reaction pairs and `0` for negative pairs |
An optional `split` column can be provided with values `train`, `val`, and `test`.
---
## Output interpretation
EZHit can report the following outputs:
| Output | Description |
|---|---|
| Match probability | Predicted enzyme–reaction compatibility probability |
| Ensemble uncertainty | Model-disagreement-based uncertainty estimate |
| Mahalanobis distance | Latent-space distance from the learned training distribution |
A typical interpretation is:
| Probability | Mahalanobis distance | Interpretation |
|---|---|---|
| High | Low | High-priority candidate |
| High | High | Potentially useful but less reliable or out-of-distribution |
| Low | Low | In-distribution but predicted as incompatible |
| Low | High | Low-priority candidate |
Thresholds should be adjusted based on the model variant, dataset, and validation results.
---
## Mahalanobis-distance statistics
Mahalanobis-distance inference requires a `train_distribution_stat.pt` file generated from the same model architecture and latent dimension as the checkpoint used for prediction.
The expected file contains:
```python
{
"mean": positive_class_latent_mean,
"inv_cov": inverse_covariance_matrix
}
```
The latent dimension of the statistics file must match the hidden dimension of the checkpoint. For example, if the model hidden dimension is 512, the expected shapes are:
```text
mean: [512]
inv_cov: [512, 512]
```
If a checkpoint is fine-tuned with a different hidden dimension, the corresponding Mahalanobis statistics must be regenerated.
For very small fine-tuning datasets, covariance estimation may be unstable. In such cases, Mahalanobis distance should be interpreted cautiously together with probability and ensemble uncertainty.
---
## Fine-tuning
Users can fine-tune EZHit using the Colab notebook:
[Open EZHit fine-tuning notebook in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb)
The fine-tuning workflow exports:
| File | Description |
|---|---|
| `ezhit_finetuned_seed42.pt` | Fine-tuned checkpoint |
| `train_distribution_stat.pt` | Training-distribution statistics for Mahalanobis-distance inference |
| `val_predictions.csv` | Validation-set predictions |
| `test_predictions.csv` | Test-set predictions |
The fine-tuned checkpoint and `train_distribution_stat.pt` can be used for customized inference.
---
## Large-scale screening results
Large-scale enzyme–reaction screening results are provided separately as a HuggingFace Dataset repository:
`TODO: add dataset repository link`
Recommended location:
```text
https://huggingface.co/datasets/deanluo/EzHit-screening-results
```
The complete training and benchmark datasets will be archived separately on Zenodo:
`TODO: add Zenodo link`
---
## Installation for local use
Clone the code repository:
```bash
git clone https://github.com/ld139/EzHit.git
cd EzHit
```
Install dependencies:
```bash
pip install -r requirements.txt
```
The required `kan.py` implementation is already included in the GitHub repository. No separate KAN package installation is required.
---
## License
This project is released under the MIT License.
---
## Contact
For questions, please use the GitHub Issues page:
[https://github.com/ld139/EzHit/issues](https://github.com/ld139/EzHit/issues)