File size: 2,825 Bytes
3f2b878
 
d6b19e1
 
 
 
 
 
 
 
 
 
3f2b878
 
d6b19e1
 
41201a1
42b1a6f
d6b19e1
 
 
 
 
3f2b878
 
 
 
 
 
 
 
 
 
 
d6b19e1
 
 
 
 
 
 
 
 
 
5ce70e1
d6b19e1
 
 
5ce70e1
3f2b878
d6b19e1
3f2b878
d6b19e1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
license: mit
language:
- en
pipeline_tag: token-classification
library_name: transformers
tags:
- biology
- microbiology
- bert
- deep-learning
- transformers
---

# ProtBert-BFD-SS3

Pretrained model on protein sequences using a masked language modeling (MLM) objective. The model makes a per-residue (per-token) prediction of protein secondary structure (3-state accuracy), i.e. `H` (helix), `E` (strand) or `C` (coil). The model was developed by Ahmed Elnaggar et al. and more information can be found on the [GitHub repository](https://github.com/agemagician/ProtTrans) and in the [accompanying paper](https://ieeexplore.ieee.org/document/9477085). This repository is a fork of their [HuggingFace repository](https://huggingface.co/Rostlab/prot_bert_bfd_ss3).
This model is trained on uppercase amino acids: it only works with capital letter amino acids.

## Model description
The model has no auxiliary tasks like BERT's next-sentence prediction. Only the main objective - MLM - was used. 

## Inference example

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
import re

pipeline = TokenClassificationPipeline(
    model=AutoModelForTokenClassification.from_pretrained("virtual-human-chc/prot_bert_bfd_ss3"),
    tokenizer=AutoTokenizer.from_pretrained("virtual-human-chc/prot_bert_bfd_ss3", skip_special_tokens=True),
    device=0
)

sequences_example = ["MGAEEEDTAILYPFTISGNDRNGNFTINFKGTPNSTNNGCIGYSYNGDWEKIEWEGSCDGNGNLVVEVPMSKIPAGVTSGEIQIWWHSGDLKMTDYKALEHHHHHH",
                     "MNKYLFELPYERSEPGWTIRSYFDLMYNENRFLDAVENIVNKESYILDGIYCNFPDMNSYDESEHFEGVEFAVGYPPDEDDIVIVSEETCFEYVRLACEKYLQLHPEDTEKVNKLLSKIPSAGHHHHHH"]

sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

print(pipeline(sequences_example))
```

## Input

An array of uppercase letters of amino acid residues, e.g. `["PRTEINO"]`

## Output 

A list of dictionaries. The keys of the dictonaries are: `entity`, `score`, `index`, `word`, `start`, `end`. `entity` is the predicted secondary structure, score is the confidence of the model about the prediction, `index` is the position of the residue in the sequence. `word` is the residue, which the prediction is made. `start` and `end` again idetify the position of the residue. Example for a single residue: `[[{'entity': 'C', 'score': np.float32(0.9825784), 'index': 1, 'word': 'M', 'start': 0, 'end': 1}]]`.

## Copyright

Code derived from https://github.com/agemagician/ProtTrans is licensed under the MIT License, Copyright (c) 2025 Ahmed Elnaggar. The ProtTrans pretrained models are released under the under terms of the Academic Free License v3.0 License, Copyright (c) 2025 Ahmed Elnaggar. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov.