roa7n
/

knots_protbertBFD_alphafold

Text Classification

text-embeddings-inference

Model card Files Files and versions

knots_protbertBFD_alphafold / README.md

roa7n's picture

Create README.md

cc901d3 over 2 years ago

|

3.4 kB

	---
	datasets:
	- EvaKlimentova/knots_AF
	metrics:
	- accuracy
	---

	# Knots ProtBert-BFD AlphaFold

	Fine-tuned [ProtBert-BFD](https://huggingface.co/Rostlab/prot_bert_bfd) to classify proteins as knotted vs. unknotted.

	## Model Details

	- Model type: Bert
	- Language: proteins (amino acid sequences)
	- Finetuned from model: [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd)

	Model Sources:

	- Repository: [CEITEC](https://github.com/ML-Bioinfo-CEITEC/pknots_experiments)
	- Paper: TBD

	## Usage

	Dataset format:
	```
	id,sequence,label
	A0A2W5F4Z7,MGGIFRVNTYYTDLEPYLQSTKLPIYGALLDGENIYELVDKSKGILVIGNESKGIRSTIQNFIQKPITIPRIGQAESLNAAVATGIIVGQLTL,1
	...
	```

	Load the dataset:
	```
	import pandas as pd
	from datasets import Dataset, load_dataset

	df = pd.read_csv(INPUT, sep=',')
	dss = Dataset.from_pandas(df)
	```

	Predict:
	```
	import torch
	import numpy as np
	from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
	from math import exp

	def tokenize_function(s):
	seq_split = ' '.join(s['Sequence'])
	return tokenizerM1(seq_split)

	tokenizer = AutoTokenizer.from_pretrained('roa7n/knots_protbertBFD_alphafold')
	model = AutoModelForSequenceClassification.from_pretrained('roa7n/knots_protbertBFD_alphafold')

	tokenized_dataset = dss.map(tokenize_function, num_proc=4)
	tokenized_dataset.set_format('pt')
	tokenized_dataset

	training_args = TrainingArguments(<PATH>, fp16=True, per_device_eval_batch_size=50, report_to='none')

	trainer = Trainer(
	model,
	training_args,
	train_dataset=tokenized_dataset,
	eval_dataset=tokenized_dataset,
	tokenizer=tokenizerM1
	)

	predictions, _, _ = trainer.predict(tokenized_dataset)
	predictions = [np.exp(p[1]) / np.sum(np.exp(p), axis=0) for p in predictions]
	df['preds'] = predictions
	```

	## Evaluation

	Per protein family metrics:

	\| M1 ProtBert-BFD \| Dataset size \| Unknotted set size \| Accuracy \| TPR \| TNR \|
	\|:----------------------------:\|:------------:\|:------------------:\|:--------:\|:------:\|:------:\|
	\| All \| 39412 \| 19718 \| 0.9845 \| 0.9865 \| 0.9825 \|
	\| SPOUT \| 7371 \| 550 \| 0.9887 \| 0.9951 \| 0.9090 \|
	\| TDD \| 612 \| 24 \| 0.9901 \| 0.9965 \| 0.8333 \|
	\| DUF \| 716 \| 429 \| 0.9748 \| 0.9721 \| 0.9766 \|
	\| AdoMet synthase \| 1794 \| 240 \| 0.9899 \| 0.9929 \| 0.9708 \|
	\| Carbonic anhydrase \| 1531 \| 539 \| 0.9588 \| 0.9737 \| 0.9313 \|
	\| UCH \| 477 \| 125 \| 0.9056 \| 0.9602 \| 0.7520 \|
	\| ATCase/OTCase \| 3799 \| 3352 \| 0.9994 \| 0.9977 \| 0.9997 \|
	\| ribosomal-mitochondrial \| 147 \| 41 \| 0.8571 \| 1.0000 \| 0.4878 \|
	\| membrane \| 8225 \| 1493 \| 0.9811 \| 0.9904 \| 0.9390 \|
	\| VIT \| 14262 \| 12555 \| 0.9872 \| 0.9420 \| 0.9933 \|
	\| biosynthesis of lantibiotics \| 392 \| 286 \| 0.9642 \| 0.9528 \| 0.9685 \|


	## Citation [optional]

	BibTeX: TODO

	## Model Authors

	Simecek: simecek@mail.muni.cz
	Klimentova: vae@mail.muni.cz
	Sramkova: denisa.sramkova@mail.muni.cz