|
|
--- |
|
|
license: creativeml-openrail-m |
|
|
base_model: |
|
|
- Rostlab/prot_bert |
|
|
--- |
|
|
[] |
|
|
(https://colab.research.google.com/drive/1zccF8lGrF5rNQaSFPTd4wI-xvIJr-A78?usp=sharing) |
|
|
|
|
|
ProtBERT-PI enables rapid screening of potential small secreted protease inhibitors in large-scale genomic, transcriptomic, or proteomic datasets. |
|
|
|
|
|
The model assigns each input sequence to one of two classes: |
|
|
|
|
|
Positive (Potential PI): Predicted to exhibit protease inhibitor activity |
|
|
Negative (Non-PI): Predicted to lack protease inhibitor activity |
|
|
|
|
|
Output includes: |
|
|
|
|
|
Probability of the positive class (prob_class_1): ranges from 0 (low likelihood) to 1 (high likelihood of PI activity) |
|
|
Confidence score: probability of the predicted class |
|
|
|
|
|
Model Architecture and Training |
|
|
|
|
|
ProtBERT-PI is a fine-tuned sequence classification model built on ProtBERT (BertForSequenceClassification): |
|
|
|
|
|
Base model: Rostlab/prot_bert |
|
|
Pre-trained on large corpora of protein sequences using masked language modeling |
|
|
|
|
|
Fine-tuning was performed on a curated dataset of known protease inhibitors and non-protease inhibitor negative set. |
|
|
Sequences are tokenized by inserting spaces between amino acids (standard for ProtBERT), enabling effective representation learning. |
|
|
Maximum sequence length is configurable (default: 250 AA); longer sequences are truncated. |
|
|
|
|
|
Positive examples: known protease inhibitors (<250 AA) from the MEROPS database |
|
|
Negative examples: non-inhibitors selected from UniProt using sequence similarity and Pfam domain analysis |
|
|
|
|
|
--- |
|
|
license: creativeml-openrail-m |
|
|
--- |