File size: 3,085 Bytes
48d872f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a638a17
 
48d872f
f26012e
a638a17
f26012e
a638a17
f26012e
73fdbb7
a638a17
f26012e
a638a17
945638b
f26012e
 
a638a17
 
f26012e
a638a17
f26012e
a638a17
 
 
f26012e
a638a17
 
 
f26012e
a638a17
 
 
 
 
f26012e
a638a17
f26012e
a638a17
 
 
f26012e
a638a17
 
 
 
 
 
f26012e
 
a638a17
 
 
f26012e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
language:
- en
library_name: pyod
tags:
- protein-structure
- anomaly-detection
- one-class
- autoencoder
- protease-inhibitor
- structural-filtering
- biology
- bioinformatics
- unsupervised-learning
datasets:
- MEROPS
- UniProt
- AlphaFold
model_name: Structural_module-protease_inhibitor
metrics:
- reconstruction_error
---

# Structural_module-protease_inhibitor

This model is an unsupervised, one-class deep learning autoencoder designed to filter protein structures. It identifies candidates that are structurally inconsistent with curated protease inhibitor (PI) features learned from known PI databases.

Use colab userinterface for making prediction from protein structure files in .cif format. 
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JLhLpvXG4plzPtIliG_CJnYui6P8Pu1J?usp=sharing)

## Model Description
An autoencoder model trained on the structural embedding  of known protease inhibitors and assigns a **reconstruction error** to inputs. High reconstruction errors indicate potential non-PIs based on structural features learned from the training distribution, allowing for the filtering of non-PI-like structures.


* **Model type:** Fully connected Autoencoder (via PyOD)
* **embeddings calculated using:** [RCSB embedding model](https://github.com/rcsb/rcsb-embedding-model)

## Intended Uses & Limitations

### Intended Use
* Structural filtering and pre-selection of protease inhibitor–like protein structures in large-scale datasets.
* Quality control for generated or predicted protein structures in the context of PIs.

### Limitations
* **Novel Folds:** Novel PI folds absent from the MEROPS training data may be incorrectly rejected (false positives for anomalies).
* **Not for Clinical Use:** This model is not intended for functional annotation or clinical decision-making.

## Training Data
The model was trained on **17,889 curated protease inhibitor structures**:
1.  **Source:** MEROPS database.
2.  **Mapping:** Sequences were mapped via similarity search against taxonomy-restricted UniProt datasets (fungi, plants, bacteria).
3.  **Structures:** Corresponding 3D structures were obtained from the **AlphaFold Protein Structure Database using uniprot Ids**.

## Technical Specifications

### Input Format
The model accepts **fixed-length continuous protein structure embeddings** derived from the RCSB embedding model. 
> **Important:** Embeddings must be standardized using the provided `scaler.pkl` before inference.

### Architecture
Implemented in PyTorch via the PyOD library:
* **Encoder:** Geometrically decreasing layers.
* **Bottleneck:** Latent representation layer.
* **Decoder:** Symmetric reconstruction layers.
* **Regularization:** Batch normalization and dropout.

### Training Procedure
* **Optimizer:** Adam with weight decay.
* **Hyperparameter Tuning:** Optimized via Bayesian optimization (Optuna, TPE sampler) on a 10% validation split (~1,789 structures).
* **Objective:** Minimize Mean Squared Error (MSE) reconstruction loss on validation set.