| | --- |
| | datasets: |
| | - EvaKlimentova/knots_AF |
| | metrics: |
| | - accuracy |
| | --- |
| | |
| | # Knots ProtBert-BFD AlphaFold |
| |
|
| | Fine-tuned [ProtBert-BFD](https://huggingface.co/Rostlab/prot_bert_bfd) to classify proteins as knotted vs. unknotted. |
| |
|
| | ## Model Details |
| |
|
| | - **Model type:** Bert |
| | - **Language:** proteins (amino acid sequences) |
| | - **Finetuned from model:** [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd) |
| |
|
| | Model Sources: |
| |
|
| | - **Repository:** [CEITEC](https://github.com/ML-Bioinfo-CEITEC/pknots_experiments) |
| | - **Paper:** TBD |
| |
|
| | ## Usage |
| |
|
| | Dataset format: |
| | ``` |
| | id,sequence,label |
| | A0A2W5F4Z7,MGGIFRVNTYYTDLEPYLQSTKLPIYGALLDGENIYELVDKSKGILVIGNESKGIRSTIQNFIQKPITIPRIGQAESLNAAVATGIIVGQLTL,1 |
| | ... |
| | ``` |
| |
|
| | Load the dataset: |
| | ``` |
| | import pandas as pd |
| | from datasets import Dataset, load_dataset |
| | |
| | df = pd.read_csv(INPUT, sep=',') |
| | dss = Dataset.from_pandas(df) |
| | ``` |
| |
|
| | Predict: |
| | ``` |
| | import torch |
| | import numpy as np |
| | from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments |
| | from math import exp |
| | |
| | def tokenize_function(s): |
| | seq_split = ' '.join(s['Sequence']) |
| | return tokenizerM1(seq_split) |
| | |
| | tokenizer = AutoTokenizer.from_pretrained('roa7n/knots_protbertBFD_alphafold') |
| | model = AutoModelForSequenceClassification.from_pretrained('roa7n/knots_protbertBFD_alphafold') |
| | |
| | tokenized_dataset = dss.map(tokenize_function, num_proc=4) |
| | tokenized_dataset.set_format('pt') |
| | tokenized_dataset |
| | |
| | training_args = TrainingArguments(<PATH>, fp16=True, per_device_eval_batch_size=50, report_to='none') |
| | |
| | trainer = Trainer( |
| | model, |
| | training_args, |
| | train_dataset=tokenized_dataset, |
| | eval_dataset=tokenized_dataset, |
| | tokenizer=tokenizerM1 |
| | ) |
| | |
| | predictions, _, _ = trainer.predict(tokenized_dataset) |
| | predictions = [np.exp(p[1]) / np.sum(np.exp(p), axis=0) for p in predictions] |
| | df['preds'] = predictions |
| | ``` |
| |
|
| | ## Evaluation |
| |
|
| | Per protein family metrics: |
| |
|
| | | M1 ProtBert-BFD | Dataset size | Unknotted set size | Accuracy | TPR | TNR | |
| | |:----------------------------:|:------------:|:------------------:|:--------:|:------:|:------:| |
| | | All | 39412 | 19718 | **0.9845** | 0.9865 | 0.9825 | |
| | | SPOUT | 7371 | 550 | 0.9887 | 0.9951 | 0.9090 | |
| | | TDD | 612 | 24 | 0.9901 | 0.9965 | 0.8333 | |
| | | DUF | 716 | 429 | 0.9748 | 0.9721 | 0.9766 | |
| | | AdoMet synthase | 1794 | 240 | 0.9899 | 0.9929 | 0.9708 | |
| | | Carbonic anhydrase | 1531 | 539 | 0.9588 | 0.9737 | 0.9313 | |
| | | UCH | 477 | 125 | 0.9056 | 0.9602 | 0.7520 | |
| | | ATCase/OTCase | 3799 | 3352 | 0.9994 | 0.9977 | 0.9997 | |
| | | ribosomal-mitochondrial | 147 | 41 | 0.8571 | 1.0000 | 0.4878 | |
| | | membrane | 8225 | 1493 | 0.9811 | 0.9904 | 0.9390 | |
| | | VIT | 14262 | 12555 | 0.9872 | 0.9420 | 0.9933 | |
| | | biosynthesis of lantibiotics | 392 | 286 | 0.9642 | 0.9528 | 0.9685 | |
| |
|
| |
|
| | ## Citation [optional] |
| |
|
| | **BibTeX:** TODO |
| |
|
| | ## Model Authors |
| |
|
| | Simecek: simecek@mail.muni.cz |
| | Klimentova: vae@mail.muni.cz |
| | Sramkova: denisa.sramkova@mail.muni.cz |