File size: 1,635 Bytes
1555623
 
 
dd57259
1555623
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38fca4d
 
 
 
 
 
1555623
 
 
dd57259
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
---
license: creativeml-openrail-m
base_model:
- Rostlab/prot_bert
---
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]
(https://colab.research.google.com/drive/1zccF8lGrF5rNQaSFPTd4wI-xvIJr-A78?usp=sharing)

ProtBERT-PI enables rapid screening of potential small secreted protease inhibitors in large-scale genomic, transcriptomic, or proteomic datasets.

The model assigns each input sequence to one of two classes:

    Positive (Potential PI): Predicted to exhibit protease inhibitor activity
    Negative (Non-PI): Predicted to lack protease inhibitor activity

Output includes:

    Probability of the positive class (prob_class_1): ranges from 0 (low likelihood) to 1 (high likelihood of PI activity)
    Confidence score: probability of the predicted class

Model Architecture and Training

ProtBERT-PI is a fine-tuned sequence classification model built on ProtBERT (BertForSequenceClassification):

    Base model: Rostlab/prot_bert
    Pre-trained on large corpora of protein sequences using masked language modeling

Fine-tuning was performed on a curated dataset of known protease inhibitors and non-protease inhibitor negative set. 
Sequences are tokenized by inserting spaces between amino acids (standard for ProtBERT), enabling effective representation learning. 
Maximum sequence length is configurable (default: 250 AA); longer sequences are truncated.

    Positive examples: known protease inhibitors (<250 AA) from the MEROPS database
    Negative examples: non-inhibitors selected from UniProt using sequence similarity and Pfam domain analysis

---
license: creativeml-openrail-m
---