| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - enzyme |
| - enzyme-reaction |
| - reaction-retrieval |
| - protein-sequence |
| - protein-language-model |
| - bioinformatics |
| - computational-biology |
| - uncertainty |
| - mahalanobis-distance |
| library_name: pytorch |
| --- |
| |
| # EZHit |
|
|
| **EZHit** is a lightweight enzyme–reaction retrieval model for predicting potential catalytic compatibility between enzyme sequences and biochemical reactions. |
|
|
| Given an enzyme amino-acid sequence and a reaction SMILES, EZHit estimates whether the enzyme is likely to catalyze the reaction. The released checkpoints can be used for enzyme–reaction pair prediction, custom fine-tuning, and uncertainty-aware inference with Mahalanobis-distance-based distribution assessment. |
|
|
| --- |
|
|
| ## Online demo |
|
|
| An interactive web demo is available at: |
|
|
| [EZHit HuggingFace Space](https://huggingface.co/spaces/deanluo/Enzyme-Catalysis-Predictor) |
|
|
| The Space supports: |
|
|
| - enzyme–reaction pair prediction |
| - ensemble probability output |
| - ensemble uncertainty estimation |
| - Mahalanobis-distance-based reliability assessment |
| - reaction visualization |
|
|
| --- |
|
|
| ## Code and Colab notebook |
|
|
| The source code and Colab fine-tuning notebook are available at: |
|
|
| - GitHub repository: [ld139/EzHit](https://github.com/ld139/EzHit) |
| - Colab fine-tuning notebook: [Open in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb) |
|
|
| The Colab notebook allows users to fine-tune EZHit on their own enzyme–reaction datasets and export a fine-tuned checkpoint together with `train_distribution_stat.pt` for Mahalanobis-distance inference. |
|
|
| --- |
|
|
|
|
| ## Model variants |
|
|
| | Model group | File pattern | Description | |
| |---|---|---| |
| | General model | `binarycls_best_val_seed*.pt` | General enzyme–reaction compatibility model | |
| | Cytochrome P450 model | `ft_p450_best_seed*.pt` | Fine-tuned model for cytochrome P450-related prediction | |
| | Phosphatase model | `ft_phosphatase_best_seed*.pt` | Fine-tuned model for phosphatase-related prediction | |
| | Terpene synthase model | `ft_terpene_best_seed*.pt` | Fine-tuned model for terpene synthase-related prediction | |
|
|
| Each model group may contain multiple seed checkpoints for ensemble prediction. |
|
|
| --- |
|
|
| ## Download checkpoints |
|
|
| Install the HuggingFace Hub client: |
|
|
| ```bash |
| pip install -U huggingface_hub |
| ``` |
|
|
| Download a checkpoint: |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| |
| ckpt_path = hf_hub_download( |
| repo_id="deanluo/EzHit", |
| filename="checkpoints/binarycls_best_val_seed40.pt" |
| ) |
| |
| print(ckpt_path) |
| ``` |
|
|
| Download Mahalanobis statistics: |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| |
| stat_path = hf_hub_download( |
| repo_id="deanluo/EzHit", |
| filename="uncertainty/general_train_distribution_stat.pt" |
| ) |
| |
| print(stat_path) |
| ``` |
|
|
| Download all files from this repository: |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| local_dir = snapshot_download( |
| repo_id="deanluo/EzHit", |
| local_dir="EzHit_checkpoints" |
| ) |
| |
| print(local_dir) |
| ``` |
|
|
| --- |
|
|
| ## Input format |
|
|
| EZHit takes two main inputs: |
|
|
| | Input | Description | |
| |---|---| |
| | Enzyme sequence | Amino-acid sequence of the enzyme | |
| | Reaction SMILES | Reaction in `reactants>>products` format | |
|
|
| Example reaction SMILES: |
|
|
| ```text |
| CCO>>CC=O |
| ``` |
|
|
| For fine-tuning, the expected CSV format is: |
|
|
| ```csv |
| protein_sequence,CANO_RXN_SMILES,Label |
| MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG,CCO>>CC=O,1 |
| MKKLLPTAAAGLLLLAAQPAMA,CCO>>CC=O,0 |
| ``` |
|
|
| Required columns: |
|
|
| | Column | Description | |
| |---|---| |
| | `protein_sequence` | Enzyme amino-acid sequence | |
| | `CANO_RXN_SMILES` | Reaction SMILES in `reactants>>products` format | |
| | `Label` | Binary label. `1` for compatible enzyme–reaction pairs and `0` for negative pairs | |
|
|
| An optional `split` column can be provided with values `train`, `val`, and `test`. |
|
|
| --- |
|
|
| ## Output interpretation |
|
|
| EZHit can report the following outputs: |
|
|
| | Output | Description | |
| |---|---| |
| | Match probability | Predicted enzyme–reaction compatibility probability | |
| | Ensemble uncertainty | Model-disagreement-based uncertainty estimate | |
| | Mahalanobis distance | Latent-space distance from the learned training distribution | |
|
|
| A typical interpretation is: |
|
|
| | Probability | Mahalanobis distance | Interpretation | |
| |---|---|---| |
| | High | Low | High-priority candidate | |
| | High | High | Potentially useful but less reliable or out-of-distribution | |
| | Low | Low | In-distribution but predicted as incompatible | |
| | Low | High | Low-priority candidate | |
|
|
| Thresholds should be adjusted based on the model variant, dataset, and validation results. |
|
|
| --- |
|
|
| ## Mahalanobis-distance statistics |
|
|
| Mahalanobis-distance inference requires a `train_distribution_stat.pt` file generated from the same model architecture and latent dimension as the checkpoint used for prediction. |
|
|
| The expected file contains: |
|
|
| ```python |
| { |
| "mean": positive_class_latent_mean, |
| "inv_cov": inverse_covariance_matrix |
| } |
| ``` |
|
|
| The latent dimension of the statistics file must match the hidden dimension of the checkpoint. For example, if the model hidden dimension is 512, the expected shapes are: |
|
|
| ```text |
| mean: [512] |
| inv_cov: [512, 512] |
| ``` |
|
|
| If a checkpoint is fine-tuned with a different hidden dimension, the corresponding Mahalanobis statistics must be regenerated. |
|
|
| For very small fine-tuning datasets, covariance estimation may be unstable. In such cases, Mahalanobis distance should be interpreted cautiously together with probability and ensemble uncertainty. |
|
|
| --- |
|
|
| ## Fine-tuning |
|
|
| Users can fine-tune EZHit using the Colab notebook: |
|
|
| [Open EZHit fine-tuning notebook in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb) |
|
|
| The fine-tuning workflow exports: |
|
|
| | File | Description | |
| |---|---| |
| | `ezhit_finetuned_seed42.pt` | Fine-tuned checkpoint | |
| | `train_distribution_stat.pt` | Training-distribution statistics for Mahalanobis-distance inference | |
| | `val_predictions.csv` | Validation-set predictions | |
| | `test_predictions.csv` | Test-set predictions | |
|
|
| The fine-tuned checkpoint and `train_distribution_stat.pt` can be used for customized inference. |
|
|
| --- |
|
|
| ## Large-scale screening results |
|
|
| Large-scale enzyme–reaction screening results are provided separately as a HuggingFace Dataset repository: |
|
|
| `TODO: add dataset repository link` |
|
|
| Recommended location: |
|
|
| ```text |
| https://huggingface.co/datasets/deanluo/EzHit-screening-results |
| ``` |
|
|
| The complete training and benchmark datasets will be archived separately on Zenodo: |
|
|
| `TODO: add Zenodo link` |
|
|
| --- |
|
|
| ## Installation for local use |
|
|
| Clone the code repository: |
|
|
| ```bash |
| git clone https://github.com/ld139/EzHit.git |
| cd EzHit |
| ``` |
|
|
| Install dependencies: |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| The required `kan.py` implementation is already included in the GitHub repository. No separate KAN package installation is required. |
|
|
| --- |
|
|
|
|
| ## License |
|
|
| This project is released under the MIT License. |
|
|
| --- |
|
|
| ## Contact |
|
|
| For questions, please use the GitHub Issues page: |
|
|
| [https://github.com/ld139/EzHit/issues](https://github.com/ld139/EzHit/issues) |
|
|