|
|
--- |
|
|
language: |
|
|
- en |
|
|
library_name: pyod |
|
|
tags: |
|
|
- protein-structure |
|
|
- anomaly-detection |
|
|
- one-class |
|
|
- autoencoder |
|
|
- protease-inhibitor |
|
|
- structural-filtering |
|
|
- biology |
|
|
- bioinformatics |
|
|
- unsupervised-learning |
|
|
datasets: |
|
|
- MEROPS |
|
|
- UniProt |
|
|
- AlphaFold |
|
|
model_name: Structural_module-protease_inhibitor |
|
|
metrics: |
|
|
- reconstruction_error |
|
|
--- |
|
|
|
|
|
# Structural_module-protease_inhibitor |
|
|
|
|
|
This model is an unsupervised, one-class deep learning autoencoder designed to filter protein structures. It identifies candidates that are structurally inconsistent with curated protease inhibitor (PI) features learned from known PI databases. |
|
|
|
|
|
Use colab userinterface for making prediction from protein structure files in .cif format. |
|
|
[](https://colab.research.google.com/drive/1JLhLpvXG4plzPtIliG_CJnYui6P8Pu1J?usp=sharing) |
|
|
|
|
|
## Model Description |
|
|
An autoencoder model trained on the structural embedding of known protease inhibitors and assigns a **reconstruction error** to inputs. High reconstruction errors indicate potential non-PIs based on structural features learned from the training distribution, allowing for the filtering of non-PI-like structures. |
|
|
|
|
|
|
|
|
* **Model type:** Fully connected Autoencoder (via PyOD) |
|
|
* **embeddings calculated using:** [RCSB embedding model](https://github.com/rcsb/rcsb-embedding-model) |
|
|
|
|
|
## Intended Uses & Limitations |
|
|
|
|
|
### Intended Use |
|
|
* Structural filtering and pre-selection of protease inhibitor–like protein structures in large-scale datasets. |
|
|
* Quality control for generated or predicted protein structures in the context of PIs. |
|
|
|
|
|
### Limitations |
|
|
* **Novel Folds:** Novel PI folds absent from the MEROPS training data may be incorrectly rejected (false positives for anomalies). |
|
|
* **Not for Clinical Use:** This model is not intended for functional annotation or clinical decision-making. |
|
|
|
|
|
## Training Data |
|
|
The model was trained on **17,889 curated protease inhibitor structures**: |
|
|
1. **Source:** MEROPS database. |
|
|
2. **Mapping:** Sequences were mapped via similarity search against taxonomy-restricted UniProt datasets (fungi, plants, bacteria). |
|
|
3. **Structures:** Corresponding 3D structures were obtained from the **AlphaFold Protein Structure Database using uniprot Ids**. |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Input Format |
|
|
The model accepts **fixed-length continuous protein structure embeddings** derived from the RCSB embedding model. |
|
|
> **Important:** Embeddings must be standardized using the provided `scaler.pkl` before inference. |
|
|
|
|
|
### Architecture |
|
|
Implemented in PyTorch via the PyOD library: |
|
|
* **Encoder:** Geometrically decreasing layers. |
|
|
* **Bottleneck:** Latent representation layer. |
|
|
* **Decoder:** Symmetric reconstruction layers. |
|
|
* **Regularization:** Batch normalization and dropout. |
|
|
|
|
|
### Training Procedure |
|
|
* **Optimizer:** Adam with weight decay. |
|
|
* **Hyperparameter Tuning:** Optimized via Bayesian optimization (Optuna, TPE sampler) on a 10% validation split (~1,789 structures). |
|
|
* **Objective:** Minimize Mean Squared Error (MSE) reconstruction loss on validation set. |
|
|
|
|
|
|