Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
library_name: pyod
|
| 5 |
+
tags:
|
| 6 |
+
- protein-structure
|
| 7 |
+
- anomaly-detection
|
| 8 |
+
- one-class
|
| 9 |
+
- autoencoder
|
| 10 |
+
- protease-inhibitor
|
| 11 |
+
- structural-filtering
|
| 12 |
+
- biology
|
| 13 |
+
- bioinformatics
|
| 14 |
+
- unsupervised-learning
|
| 15 |
+
|
| 16 |
+
datasets:
|
| 17 |
+
- MEROPS
|
| 18 |
+
- UniProt
|
| 19 |
+
- AlphaFold
|
| 20 |
+
|
| 21 |
+
model_name: Structural_module-protease_inhibitor
|
| 22 |
+
|
| 23 |
+
model_description: |
|
| 24 |
+
Structural_module-protease_inhibitoris an unsupervised, one-class deep learning model for filtering protein
|
| 25 |
+
structures that are structurally inconsistent with curated protease inhibitor (PI) like features learned from (RCSBembeddingmodel, github.com/rcsb/rcsb-embedding-model) from protease inhibitor databases. The model learns the structural embedding manifold of known protease
|
| 26 |
+
inhibitors and assigns higher reconstruction error to structurally dissimilar inputs.
|
| 27 |
+
user_interface: |
|
| 28 |
+
Easy-to-use inference interface:
|
| 29 |
+
"[](https://colab.research.google.com/drive/1JLhLpvXG4plzPtIliG_CJnYui6P8Pu1J?usp=sharing)"
|
| 30 |
+
training_data_description: |
|
| 31 |
+
The model was trained on 17,889 curated protease inhibitor structures from the MEROPS
|
| 32 |
+
database. MEROPS sequences were mapped via similarity search against taxonomy-
|
| 33 |
+
restricted UniProt datasets (fungi, plants, bacteria), and corresponding structures
|
| 34 |
+
were obtained from the AlphaFold Protein Structure Database and used for traning the model.
|
| 35 |
+
|
| 36 |
+
input_format: |
|
| 37 |
+
Fixed-length continuous protein structure embeddings derived from three-dimensional
|
| 38 |
+
structural features (RCSBembedding model, github.com/rcsb/rcsb-embedding-model). Embeddings must be standardized using the provided scaler.pkl
|
| 39 |
+
before inference.
|
| 40 |
+
|
| 41 |
+
model_architecture: |
|
| 42 |
+
Fully connected autoencoder implemented in PyTorch via the PyOD library, featuring
|
| 43 |
+
a geometrically decreasing encoder, latent bottleneck, symmetric decoder, batch
|
| 44 |
+
normalization, dropout regularization, and mean squared reconstruction loss.
|
| 45 |
+
|
| 46 |
+
training_procedure: |
|
| 47 |
+
The model was trained using the Adam optimizer with weight decay and mini-batch
|
| 48 |
+
stochastic gradient descent. Hyperparameters were optimized using Bayesian
|
| 49 |
+
optimization (Optuna, TPE sampler) on an independent 10% validation split
|
| 50 |
+
(~1,789 structures). The tuning objective was to minimize reconstruction error on
|
| 51 |
+
unseen but structurally valid protease inhibitor examples. The final model was
|
| 52 |
+
retrained on the full dataset using the optimal hyperparameters with fixed random
|
| 53 |
+
seeds.
|
| 54 |
+
|
| 55 |
+
outputs: |
|
| 56 |
+
The model outputs a reconstruction-based anomaly score, an outlier probability,
|
| 57 |
+
and a confidence estimate. Low reconstruction-based anomaly scores indicate structural consistency with known
|
| 58 |
+
protease inhibitor folds, while high scores indicate structural dissimilarity.
|
| 59 |
+
|
| 60 |
+
intended_use: |
|
| 61 |
+
Structural filtering and pre-selection of protease inhibitor–like protein structures
|
| 62 |
+
in large-scale datasets.
|
| 63 |
+
|
| 64 |
+
limitations: |
|
| 65 |
+
Novel PI folds absent from the training data may be incorrectly rejected.
|
| 66 |
+
|
| 67 |
+
not_intended_use: |
|
| 68 |
+
Functional annotation, clinical
|
| 69 |
+
decision-making.
|
| 70 |
+
|
| 71 |
+
reproducibility: |
|
| 72 |
+
All preprocessing parameters, training configurations are provided
|
| 73 |
+
to enable exact reproduction of results.
|
| 74 |
+
|
| 75 |
+
citation: |
|
| 76 |
+
Please cite the associated publication and acknowledge the PyOD library, the MEROPS
|
| 77 |
+
protease inhibitor database, and the AlphaFold Protein Structure Database.
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
---
|