EZHit

EZHit is a lightweight enzyme–reaction retrieval model for predicting potential catalytic compatibility between enzyme sequences and biochemical reactions.

Given an enzyme amino-acid sequence and a reaction SMILES, EZHit estimates whether the enzyme is likely to catalyze the reaction. The released checkpoints can be used for enzyme–reaction pair prediction, custom fine-tuning, and uncertainty-aware inference with Mahalanobis-distance-based distribution assessment.


Online demo

An interactive web demo is available at:

EZHit HuggingFace Space

The Space supports:

  • enzyme–reaction pair prediction
  • ensemble probability output
  • ensemble uncertainty estimation
  • Mahalanobis-distance-based reliability assessment
  • reaction visualization

Code and Colab notebook

The source code and Colab fine-tuning notebook are available at:

The Colab notebook allows users to fine-tune EZHit on their own enzyme–reaction datasets and export a fine-tuned checkpoint together with train_distribution_stat.pt for Mahalanobis-distance inference.


Model variants

Model group File pattern Description
General model binarycls_best_val_seed*.pt General enzyme–reaction compatibility model
Cytochrome P450 model ft_p450_best_seed*.pt Fine-tuned model for cytochrome P450-related prediction
Phosphatase model ft_phosphatase_best_seed*.pt Fine-tuned model for phosphatase-related prediction
Terpene synthase model ft_terpene_best_seed*.pt Fine-tuned model for terpene synthase-related prediction

Each model group may contain multiple seed checkpoints for ensemble prediction.


Download checkpoints

Install the HuggingFace Hub client:

pip install -U huggingface_hub

Download a checkpoint:

from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="deanluo/EzHit",
    filename="checkpoints/binarycls_best_val_seed40.pt"
)

print(ckpt_path)

Download Mahalanobis statistics:

from huggingface_hub import hf_hub_download

stat_path = hf_hub_download(
    repo_id="deanluo/EzHit",
    filename="uncertainty/general_train_distribution_stat.pt"
)

print(stat_path)

Download all files from this repository:

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="deanluo/EzHit",
    local_dir="EzHit_checkpoints"
)

print(local_dir)

Input format

EZHit takes two main inputs:

Input Description
Enzyme sequence Amino-acid sequence of the enzyme
Reaction SMILES Reaction in reactants>>products format

Example reaction SMILES:

CCO>>CC=O

For fine-tuning, the expected CSV format is:

protein_sequence,CANO_RXN_SMILES,Label
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG,CCO>>CC=O,1
MKKLLPTAAAGLLLLAAQPAMA,CCO>>CC=O,0

Required columns:

Column Description
protein_sequence Enzyme amino-acid sequence
CANO_RXN_SMILES Reaction SMILES in reactants>>products format
Label Binary label. 1 for compatible enzyme–reaction pairs and 0 for negative pairs

An optional split column can be provided with values train, val, and test.


Output interpretation

EZHit can report the following outputs:

Output Description
Match probability Predicted enzyme–reaction compatibility probability
Ensemble uncertainty Model-disagreement-based uncertainty estimate
Mahalanobis distance Latent-space distance from the learned training distribution

A typical interpretation is:

Probability Mahalanobis distance Interpretation
High Low High-priority candidate
High High Potentially useful but less reliable or out-of-distribution
Low Low In-distribution but predicted as incompatible
Low High Low-priority candidate

Thresholds should be adjusted based on the model variant, dataset, and validation results.


Mahalanobis-distance statistics

Mahalanobis-distance inference requires a train_distribution_stat.pt file generated from the same model architecture and latent dimension as the checkpoint used for prediction.

The expected file contains:

{
    "mean": positive_class_latent_mean,
    "inv_cov": inverse_covariance_matrix
}

The latent dimension of the statistics file must match the hidden dimension of the checkpoint. For example, if the model hidden dimension is 512, the expected shapes are:

mean:    [512]
inv_cov: [512, 512]

If a checkpoint is fine-tuned with a different hidden dimension, the corresponding Mahalanobis statistics must be regenerated.

For very small fine-tuning datasets, covariance estimation may be unstable. In such cases, Mahalanobis distance should be interpreted cautiously together with probability and ensemble uncertainty.


Fine-tuning

Users can fine-tune EZHit using the Colab notebook:

Open EZHit fine-tuning notebook in Colab

The fine-tuning workflow exports:

File Description
ezhit_finetuned_seed42.pt Fine-tuned checkpoint
train_distribution_stat.pt Training-distribution statistics for Mahalanobis-distance inference
val_predictions.csv Validation-set predictions
test_predictions.csv Test-set predictions

The fine-tuned checkpoint and train_distribution_stat.pt can be used for customized inference.


Large-scale screening results

Large-scale enzyme–reaction screening results are provided separately as a HuggingFace Dataset repository:

TODO: add dataset repository link

Recommended location:

https://huggingface.co/datasets/deanluo/EzHit-screening-results

The complete training and benchmark datasets will be archived separately on Zenodo:

TODO: add Zenodo link


Installation for local use

Clone the code repository:

git clone https://github.com/ld139/EzHit.git
cd EzHit

Install dependencies:

pip install -r requirements.txt

The required kan.py implementation is already included in the GitHub repository. No separate KAN package installation is required.


License

This project is released under the MIT License.


Contact

For questions, please use the GitHub Issues page:

https://github.com/ld139/EzHit/issues

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support