license: mit
language:
- en
tags:
- enzyme
- enzyme-reaction
- reaction-retrieval
- protein-sequence
- protein-language-model
- bioinformatics
- computational-biology
- uncertainty
- mahalanobis-distance
library_name: pytorch
EZHit
EZHit is a lightweight enzyme–reaction retrieval model for predicting potential catalytic compatibility between enzyme sequences and biochemical reactions.
Given an enzyme amino-acid sequence and a reaction SMILES, EZHit estimates whether the enzyme is likely to catalyze the reaction. The released checkpoints can be used for enzyme–reaction pair prediction, custom fine-tuning, and uncertainty-aware inference with Mahalanobis-distance-based distribution assessment.
Online demo
An interactive web demo is available at:
The Space supports:
- enzyme–reaction pair prediction
- ensemble probability output
- ensemble uncertainty estimation
- Mahalanobis-distance-based reliability assessment
- reaction visualization
Code and Colab notebook
The source code and Colab fine-tuning notebook are available at:
- GitHub repository: ld139/EzHit
- Colab fine-tuning notebook: Open in Colab
The Colab notebook allows users to fine-tune EZHit on their own enzyme–reaction datasets and export a fine-tuned checkpoint together with train_distribution_stat.pt for Mahalanobis-distance inference.
Model variants
| Model group | File pattern | Description |
|---|---|---|
| General model | binarycls_best_val_seed*.pt |
General enzyme–reaction compatibility model |
| Cytochrome P450 model | ft_p450_best_seed*.pt |
Fine-tuned model for cytochrome P450-related prediction |
| Phosphatase model | ft_phosphatase_best_seed*.pt |
Fine-tuned model for phosphatase-related prediction |
| Terpene synthase model | ft_terpene_best_seed*.pt |
Fine-tuned model for terpene synthase-related prediction |
Each model group may contain multiple seed checkpoints for ensemble prediction.
Download checkpoints
Install the HuggingFace Hub client:
pip install -U huggingface_hub
Download a checkpoint:
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="deanluo/EzHit",
filename="checkpoints/binarycls_best_val_seed40.pt"
)
print(ckpt_path)
Download Mahalanobis statistics:
from huggingface_hub import hf_hub_download
stat_path = hf_hub_download(
repo_id="deanluo/EzHit",
filename="uncertainty/general_train_distribution_stat.pt"
)
print(stat_path)
Download all files from this repository:
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="deanluo/EzHit",
local_dir="EzHit_checkpoints"
)
print(local_dir)
Input format
EZHit takes two main inputs:
| Input | Description |
|---|---|
| Enzyme sequence | Amino-acid sequence of the enzyme |
| Reaction SMILES | Reaction in reactants>>products format |
Example reaction SMILES:
CCO>>CC=O
For fine-tuning, the expected CSV format is:
protein_sequence,CANO_RXN_SMILES,Label
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG,CCO>>CC=O,1
MKKLLPTAAAGLLLLAAQPAMA,CCO>>CC=O,0
Required columns:
| Column | Description |
|---|---|
protein_sequence |
Enzyme amino-acid sequence |
CANO_RXN_SMILES |
Reaction SMILES in reactants>>products format |
Label |
Binary label. 1 for compatible enzyme–reaction pairs and 0 for negative pairs |
An optional split column can be provided with values train, val, and test.
Output interpretation
EZHit can report the following outputs:
| Output | Description |
|---|---|
| Match probability | Predicted enzyme–reaction compatibility probability |
| Ensemble uncertainty | Model-disagreement-based uncertainty estimate |
| Mahalanobis distance | Latent-space distance from the learned training distribution |
A typical interpretation is:
| Probability | Mahalanobis distance | Interpretation |
|---|---|---|
| High | Low | High-priority candidate |
| High | High | Potentially useful but less reliable or out-of-distribution |
| Low | Low | In-distribution but predicted as incompatible |
| Low | High | Low-priority candidate |
Thresholds should be adjusted based on the model variant, dataset, and validation results.
Mahalanobis-distance statistics
Mahalanobis-distance inference requires a train_distribution_stat.pt file generated from the same model architecture and latent dimension as the checkpoint used for prediction.
The expected file contains:
{
"mean": positive_class_latent_mean,
"inv_cov": inverse_covariance_matrix
}
The latent dimension of the statistics file must match the hidden dimension of the checkpoint. For example, if the model hidden dimension is 512, the expected shapes are:
mean: [512]
inv_cov: [512, 512]
If a checkpoint is fine-tuned with a different hidden dimension, the corresponding Mahalanobis statistics must be regenerated.
For very small fine-tuning datasets, covariance estimation may be unstable. In such cases, Mahalanobis distance should be interpreted cautiously together with probability and ensemble uncertainty.
Fine-tuning
Users can fine-tune EZHit using the Colab notebook:
Open EZHit fine-tuning notebook in Colab
The fine-tuning workflow exports:
| File | Description |
|---|---|
ezhit_finetuned_seed42.pt |
Fine-tuned checkpoint |
train_distribution_stat.pt |
Training-distribution statistics for Mahalanobis-distance inference |
val_predictions.csv |
Validation-set predictions |
test_predictions.csv |
Test-set predictions |
The fine-tuned checkpoint and train_distribution_stat.pt can be used for customized inference.
Large-scale screening results
Large-scale enzyme–reaction screening results are provided separately as a HuggingFace Dataset repository:
TODO: add dataset repository link
Recommended location:
https://huggingface.co/datasets/deanluo/EzHit-screening-results
The complete training and benchmark datasets will be archived separately on Zenodo:
TODO: add Zenodo link
Installation for local use
Clone the code repository:
git clone https://github.com/ld139/EzHit.git
cd EzHit
Install dependencies:
pip install -r requirements.txt
The required kan.py implementation is already included in the GitHub repository. No separate KAN package installation is required.
License
This project is released under the MIT License.
Contact
For questions, please use the GitHub Issues page: