EzHit / README.md

Update README.md

bb929a8 verified 3 days ago

7.06 kB

	---
	license: mit
	language:
	- en
	tags:
	- enzyme
	- enzyme-reaction
	- reaction-retrieval
	- protein-sequence
	- protein-language-model
	- bioinformatics
	- computational-biology
	- uncertainty
	- mahalanobis-distance
	library_name: pytorch
	---

	# EZHit

	EZHit is a lightweight enzyme–reaction retrieval model for predicting potential catalytic compatibility between enzyme sequences and biochemical reactions.

	Given an enzyme amino-acid sequence and a reaction SMILES, EZHit estimates whether the enzyme is likely to catalyze the reaction. The released checkpoints can be used for enzyme–reaction pair prediction, custom fine-tuning, and uncertainty-aware inference with Mahalanobis-distance-based distribution assessment.

	---

	## Online demo

	An interactive web demo is available at:

	[EZHit HuggingFace Space](https://huggingface.co/spaces/deanluo/Enzyme-Catalysis-Predictor)

	The Space supports:

	- enzyme–reaction pair prediction
	- ensemble probability output
	- ensemble uncertainty estimation
	- Mahalanobis-distance-based reliability assessment
	- reaction visualization

	---

	## Code and Colab notebook

	The source code and Colab fine-tuning notebook are available at:

	- GitHub repository: [ld139/EzHit](https://github.com/ld139/EzHit)
	- Colab fine-tuning notebook: [Open in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb)

	The Colab notebook allows users to fine-tune EZHit on their own enzyme–reaction datasets and export a fine-tuned checkpoint together with `train_distribution_stat.pt` for Mahalanobis-distance inference.

	---


	## Model variants

	\| Model group \| File pattern \| Description \|
	\|---\|---\|---\|
	\| General model \| `binarycls_best_val_seed*.pt` \| General enzyme–reaction compatibility model \|
	\| Cytochrome P450 model \| `ft_p450_best_seed*.pt` \| Fine-tuned model for cytochrome P450-related prediction \|
	\| Phosphatase model \| `ft_phosphatase_best_seed*.pt` \| Fine-tuned model for phosphatase-related prediction \|
	\| Terpene synthase model \| `ft_terpene_best_seed*.pt` \| Fine-tuned model for terpene synthase-related prediction \|

	Each model group may contain multiple seed checkpoints for ensemble prediction.

	---

	## Download checkpoints

	Install the HuggingFace Hub client:

	```bash
	pip install -U huggingface_hub
	```

	Download a checkpoint:

	```python
	from huggingface_hub import hf_hub_download

	ckpt_path = hf_hub_download(
	repo_id="deanluo/EzHit",
	filename="checkpoints/binarycls_best_val_seed40.pt"
	)

	print(ckpt_path)
	```

	Download Mahalanobis statistics:

	```python
	from huggingface_hub import hf_hub_download

	stat_path = hf_hub_download(
	repo_id="deanluo/EzHit",
	filename="uncertainty/general_train_distribution_stat.pt"
	)

	print(stat_path)
	```

	Download all files from this repository:

	```python
	from huggingface_hub import snapshot_download

	local_dir = snapshot_download(
	repo_id="deanluo/EzHit",
	local_dir="EzHit_checkpoints"
	)

	print(local_dir)
	```

	---

	## Input format

	EZHit takes two main inputs:

	\| Input \| Description \|
	\|---\|---\|
	\| Enzyme sequence \| Amino-acid sequence of the enzyme \|
	\| Reaction SMILES \| Reaction in `reactants>>products` format \|

	Example reaction SMILES:

	```text
	CCO>>CC=O
	```

	For fine-tuning, the expected CSV format is:

	```csv
	protein_sequence,CANO_RXN_SMILES,Label
	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG,CCO>>CC=O,1
	MKKLLPTAAAGLLLLAAQPAMA,CCO>>CC=O,0
	```

	Required columns:

	\| Column \| Description \|
	\|---\|---\|
	\| `protein_sequence` \| Enzyme amino-acid sequence \|
	\| `CANO_RXN_SMILES` \| Reaction SMILES in `reactants>>products` format \|
	\| `Label` \| Binary label. `1` for compatible enzyme–reaction pairs and `0` for negative pairs \|

	An optional `split` column can be provided with values `train`, `val`, and `test`.

	---

	## Output interpretation

	EZHit can report the following outputs:

	\| Output \| Description \|
	\|---\|---\|
	\| Match probability \| Predicted enzyme–reaction compatibility probability \|
	\| Ensemble uncertainty \| Model-disagreement-based uncertainty estimate \|
	\| Mahalanobis distance \| Latent-space distance from the learned training distribution \|

	A typical interpretation is:

	\| Probability \| Mahalanobis distance \| Interpretation \|
	\|---\|---\|---\|
	\| High \| Low \| High-priority candidate \|
	\| High \| High \| Potentially useful but less reliable or out-of-distribution \|
	\| Low \| Low \| In-distribution but predicted as incompatible \|
	\| Low \| High \| Low-priority candidate \|

	Thresholds should be adjusted based on the model variant, dataset, and validation results.

	---

	## Mahalanobis-distance statistics

	Mahalanobis-distance inference requires a `train_distribution_stat.pt` file generated from the same model architecture and latent dimension as the checkpoint used for prediction.

	The expected file contains:

	```python
	{
	"mean": positive_class_latent_mean,
	"inv_cov": inverse_covariance_matrix
	}
	```

	The latent dimension of the statistics file must match the hidden dimension of the checkpoint. For example, if the model hidden dimension is 512, the expected shapes are:

	```text
	mean: [512]
	inv_cov: [512, 512]
	```

	If a checkpoint is fine-tuned with a different hidden dimension, the corresponding Mahalanobis statistics must be regenerated.

	For very small fine-tuning datasets, covariance estimation may be unstable. In such cases, Mahalanobis distance should be interpreted cautiously together with probability and ensemble uncertainty.

	---

	## Fine-tuning

	Users can fine-tune EZHit using the Colab notebook:

	[Open EZHit fine-tuning notebook in Colab](https://colab.research.google.com/github/ld139/EzHit/blob/main/colab/EZHit_FineTune_Colab.ipynb)

	The fine-tuning workflow exports:

	\| File \| Description \|
	\|---\|---\|
	\| `ezhit_finetuned_seed42.pt` \| Fine-tuned checkpoint \|
	\| `train_distribution_stat.pt` \| Training-distribution statistics for Mahalanobis-distance inference \|
	\| `val_predictions.csv` \| Validation-set predictions \|
	\| `test_predictions.csv` \| Test-set predictions \|

	The fine-tuned checkpoint and `train_distribution_stat.pt` can be used for customized inference.

	---

	## Large-scale screening results

	Large-scale enzyme–reaction screening results are provided separately as a HuggingFace Dataset repository:

	`TODO: add dataset repository link`

	Recommended location:

	```text
	https://huggingface.co/datasets/deanluo/EzHit-screening-results
	```

	The complete training and benchmark datasets will be archived separately on Zenodo:

	`TODO: add Zenodo link`

	---

	## Installation for local use

	Clone the code repository:

	```bash
	git clone https://github.com/ld139/EzHit.git
	cd EzHit
	```

	Install dependencies:

	```bash
	pip install -r requirements.txt
	```

	The required `kan.py` implementation is already included in the GitHub repository. No separate KAN package installation is required.

	---


	## License

	This project is released under the MIT License.

	---

	## Contact

	For questions, please use the GitHub Issues page:

	[https://github.com/ld139/EzHit/issues](https://github.com/ld139/EzHit/issues)