--- license: mit language: - en tags: - enzyme - enzyme-reaction - reaction-retrieval - protein-sequence - protein-language-model - bioinformatics - computational-biology - uncertainty - mahalanobis-distance library_name: pytorch --- # EZHit **EZHit** is a lightweight enzyme–reaction retrieval model for predicting potential catalytic compatibility between enzyme sequences and biochemical reactions. Given an enzyme amino-acid sequence and a reaction SMILES, EZHit estimates whether the enzyme is likely to catalyze the reaction. The released checkpoints can be used for enzyme–reaction pair prediction, custom fine-tuning, and uncertainty-aware inference with Mahalanobis-distance-based distribution assessment. --- ## Online demo An interactive web demo is available at: [EZHit HuggingFace Space](https://huggingface.co/spaces/deanluo/Enzyme-Catalysis-Predictor) The Space supports: - enzyme–reaction pair prediction - ensemble probability output - ensemble uncertainty estimation - Mahalanobis-distance-based reliability assessment - reaction visualization --- ## Code and Colab notebook The source code and Colab fine-tuning notebook are available at: - GitHub repository: [ld139/EzHit](https://github.com/ld139/EzHit) - Colab fine-tuning notebook: [Open in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb) The Colab notebook allows users to fine-tune EZHit on their own enzyme–reaction datasets and export a fine-tuned checkpoint together with `train_distribution_stat.pt` for Mahalanobis-distance inference. --- ## Model variants | Model group | File pattern | Description | |---|---|---| | General model | `binarycls_best_val_seed*.pt` | General enzyme–reaction compatibility model | | Cytochrome P450 model | `ft_p450_best_seed*.pt` | Fine-tuned model for cytochrome P450-related prediction | | Phosphatase model | `ft_phosphatase_best_seed*.pt` | Fine-tuned model for phosphatase-related prediction | | Terpene synthase model | `ft_terpene_best_seed*.pt` | Fine-tuned model for terpene synthase-related prediction | Each model group may contain multiple seed checkpoints for ensemble prediction. --- ## Download checkpoints Install the HuggingFace Hub client: ```bash pip install -U huggingface_hub ``` Download a checkpoint: ```python from huggingface_hub import hf_hub_download ckpt_path = hf_hub_download( repo_id="deanluo/EzHit", filename="checkpoints/binarycls_best_val_seed40.pt" ) print(ckpt_path) ``` Download Mahalanobis statistics: ```python from huggingface_hub import hf_hub_download stat_path = hf_hub_download( repo_id="deanluo/EzHit", filename="uncertainty/general_train_distribution_stat.pt" ) print(stat_path) ``` Download all files from this repository: ```python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="deanluo/EzHit", local_dir="EzHit_checkpoints" ) print(local_dir) ``` --- ## Input format EZHit takes two main inputs: | Input | Description | |---|---| | Enzyme sequence | Amino-acid sequence of the enzyme | | Reaction SMILES | Reaction in `reactants>>products` format | Example reaction SMILES: ```text CCO>>CC=O ``` For fine-tuning, the expected CSV format is: ```csv protein_sequence,CANO_RXN_SMILES,Label MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG,CCO>>CC=O,1 MKKLLPTAAAGLLLLAAQPAMA,CCO>>CC=O,0 ``` Required columns: | Column | Description | |---|---| | `protein_sequence` | Enzyme amino-acid sequence | | `CANO_RXN_SMILES` | Reaction SMILES in `reactants>>products` format | | `Label` | Binary label. `1` for compatible enzyme–reaction pairs and `0` for negative pairs | An optional `split` column can be provided with values `train`, `val`, and `test`. --- ## Output interpretation EZHit can report the following outputs: | Output | Description | |---|---| | Match probability | Predicted enzyme–reaction compatibility probability | | Ensemble uncertainty | Model-disagreement-based uncertainty estimate | | Mahalanobis distance | Latent-space distance from the learned training distribution | A typical interpretation is: | Probability | Mahalanobis distance | Interpretation | |---|---|---| | High | Low | High-priority candidate | | High | High | Potentially useful but less reliable or out-of-distribution | | Low | Low | In-distribution but predicted as incompatible | | Low | High | Low-priority candidate | Thresholds should be adjusted based on the model variant, dataset, and validation results. --- ## Mahalanobis-distance statistics Mahalanobis-distance inference requires a `train_distribution_stat.pt` file generated from the same model architecture and latent dimension as the checkpoint used for prediction. The expected file contains: ```python { "mean": positive_class_latent_mean, "inv_cov": inverse_covariance_matrix } ``` The latent dimension of the statistics file must match the hidden dimension of the checkpoint. For example, if the model hidden dimension is 512, the expected shapes are: ```text mean: [512] inv_cov: [512, 512] ``` If a checkpoint is fine-tuned with a different hidden dimension, the corresponding Mahalanobis statistics must be regenerated. For very small fine-tuning datasets, covariance estimation may be unstable. In such cases, Mahalanobis distance should be interpreted cautiously together with probability and ensemble uncertainty. --- ## Fine-tuning Users can fine-tune EZHit using the Colab notebook: [Open EZHit fine-tuning notebook in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb) The fine-tuning workflow exports: | File | Description | |---|---| | `ezhit_finetuned_seed42.pt` | Fine-tuned checkpoint | | `train_distribution_stat.pt` | Training-distribution statistics for Mahalanobis-distance inference | | `val_predictions.csv` | Validation-set predictions | | `test_predictions.csv` | Test-set predictions | The fine-tuned checkpoint and `train_distribution_stat.pt` can be used for customized inference. --- ## Large-scale screening results Large-scale enzyme–reaction screening results are provided separately as a HuggingFace Dataset repository: `TODO: add dataset repository link` Recommended location: ```text https://huggingface.co/datasets/deanluo/EzHit-screening-results ``` The complete training and benchmark datasets will be archived separately on Zenodo: `TODO: add Zenodo link` --- ## Installation for local use Clone the code repository: ```bash git clone https://github.com/ld139/EzHit.git cd EzHit ``` Install dependencies: ```bash pip install -r requirements.txt ``` The required `kan.py` implementation is already included in the GitHub repository. No separate KAN package installation is required. --- ## License This project is released under the MIT License. --- ## Contact For questions, please use the GitHub Issues page: [https://github.com/ld139/EzHit/issues](https://github.com/ld139/EzHit/issues)