EzHit / README.md

Update README.md

bb929a8 verified 2 days ago

7.06 kB

license: mit
language:
  - en
tags:
  - enzyme
  - enzyme-reaction
  - reaction-retrieval
  - protein-sequence
  - protein-language-model
  - bioinformatics
  - computational-biology
  - uncertainty
  - mahalanobis-distance
library_name: pytorch

EZHit

EZHit is a lightweight enzyme–reaction retrieval model for predicting potential catalytic compatibility between enzyme sequences and biochemical reactions.

Given an enzyme amino-acid sequence and a reaction SMILES, EZHit estimates whether the enzyme is likely to catalyze the reaction. The released checkpoints can be used for enzyme–reaction pair prediction, custom fine-tuning, and uncertainty-aware inference with Mahalanobis-distance-based distribution assessment.

Online demo

An interactive web demo is available at:

EZHit HuggingFace Space

The Space supports:

enzyme–reaction pair prediction
ensemble probability output
ensemble uncertainty estimation
Mahalanobis-distance-based reliability assessment
reaction visualization

Code and Colab notebook

The source code and Colab fine-tuning notebook are available at:

GitHub repository: ld139/EzHit
Colab fine-tuning notebook: Open in Colab

The Colab notebook allows users to fine-tune EZHit on their own enzyme–reaction datasets and export a fine-tuned checkpoint together with train_distribution_stat.pt for Mahalanobis-distance inference.

Model variants

Model group	File pattern	Description
General model	`binarycls_best_val_seed*.pt`	General enzyme–reaction compatibility model
Cytochrome P450 model	`ft_p450_best_seed*.pt`	Fine-tuned model for cytochrome P450-related prediction
Phosphatase model	`ft_phosphatase_best_seed*.pt`	Fine-tuned model for phosphatase-related prediction
Terpene synthase model	`ft_terpene_best_seed*.pt`	Fine-tuned model for terpene synthase-related prediction

Each model group may contain multiple seed checkpoints for ensemble prediction.

Download checkpoints

Install the HuggingFace Hub client:

pip install -U huggingface_hub

Download a checkpoint:

from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="deanluo/EzHit",
    filename="checkpoints/binarycls_best_val_seed40.pt"
)

print(ckpt_path)

Download Mahalanobis statistics:

from huggingface_hub import hf_hub_download

stat_path = hf_hub_download(
    repo_id="deanluo/EzHit",
    filename="uncertainty/general_train_distribution_stat.pt"
)

print(stat_path)

Download all files from this repository:

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="deanluo/EzHit",
    local_dir="EzHit_checkpoints"
)

print(local_dir)

Input format

EZHit takes two main inputs:

Input	Description
Enzyme sequence	Amino-acid sequence of the enzyme
Reaction SMILES	Reaction in `reactants>>products` format

Example reaction SMILES:

CCO>>CC=O

For fine-tuning, the expected CSV format is:

protein_sequence,CANO_RXN_SMILES,Label
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG,CCO>>CC=O,1
MKKLLPTAAAGLLLLAAQPAMA,CCO>>CC=O,0

Required columns:

Column	Description
`protein_sequence`	Enzyme amino-acid sequence
`CANO_RXN_SMILES`	Reaction SMILES in `reactants>>products` format
`Label`	Binary label. `1` for compatible enzyme–reaction pairs and `0` for negative pairs

An optional split column can be provided with values train, val, and test.

Output interpretation

EZHit can report the following outputs:

Output	Description
Match probability	Predicted enzyme–reaction compatibility probability
Ensemble uncertainty	Model-disagreement-based uncertainty estimate
Mahalanobis distance	Latent-space distance from the learned training distribution

A typical interpretation is:

Probability	Mahalanobis distance	Interpretation
High	Low	High-priority candidate
High	High	Potentially useful but less reliable or out-of-distribution
Low	Low	In-distribution but predicted as incompatible
Low	High	Low-priority candidate

Thresholds should be adjusted based on the model variant, dataset, and validation results.

Mahalanobis-distance statistics

Mahalanobis-distance inference requires a train_distribution_stat.pt file generated from the same model architecture and latent dimension as the checkpoint used for prediction.

The expected file contains:

{
    "mean": positive_class_latent_mean,
    "inv_cov": inverse_covariance_matrix
}

The latent dimension of the statistics file must match the hidden dimension of the checkpoint. For example, if the model hidden dimension is 512, the expected shapes are:

mean:    [512]
inv_cov: [512, 512]

If a checkpoint is fine-tuned with a different hidden dimension, the corresponding Mahalanobis statistics must be regenerated.

For very small fine-tuning datasets, covariance estimation may be unstable. In such cases, Mahalanobis distance should be interpreted cautiously together with probability and ensemble uncertainty.

Fine-tuning

Users can fine-tune EZHit using the Colab notebook:

Open EZHit fine-tuning notebook in Colab

The fine-tuning workflow exports:

File	Description
`ezhit_finetuned_seed42.pt`	Fine-tuned checkpoint
`train_distribution_stat.pt`	Training-distribution statistics for Mahalanobis-distance inference
`val_predictions.csv`	Validation-set predictions
`test_predictions.csv`	Test-set predictions

The fine-tuned checkpoint and train_distribution_stat.pt can be used for customized inference.

Large-scale screening results

Large-scale enzyme–reaction screening results are provided separately as a HuggingFace Dataset repository:

TODO: add dataset repository link

Recommended location:

https://huggingface.co/datasets/deanluo/EzHit-screening-results

The complete training and benchmark datasets will be archived separately on Zenodo:

TODO: add Zenodo link

Installation for local use

Clone the code repository:

git clone https://github.com/ld139/EzHit.git
cd EzHit

Install dependencies:

pip install -r requirements.txt

The required kan.py implementation is already included in the GitHub repository. No separate KAN package installation is required.

License

This project is released under the MIT License.

Contact

For questions, please use the GitHub Issues page:

https://github.com/ld139/EzHit/issues