jimnoneill commited on
Commit
f3b0cfc
·
verified ·
1 Parent(s): 2959f87

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -0
README.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ # Pseudo Genius
5
+
6
+ Pseudo Genius is a BERT-based transformer model fine-tuned for the classification of gene sequences as either 'normal' or 'pseudogene'. It was trained specifically on Mycobacterium leprae due to its abundance of pseudogenes but has shown consistent results on other Mycobacterium species.
7
+
8
+ ## Model Description
9
+
10
+ This model was trained on a dataset extracted from Mycobacterium leprae, using DNA sequences concatenated with their respective protein sequences (separated by tabs) as inputs.
11
+
12
+ ## Intended Use
13
+
14
+ The model is intended for researchers and biologists who wish to classify gene sequences quickly. While it performs well on Mycobacterium species, it has not been tested on species with a lower GC content, such as E. coli, and users should exercise caution.
15
+
16
+ ## How to Use
17
+
18
+ To use the model, concatenate the DNA sequence and protein sequence of a gene, separated by a tab, and feed this as input to the model.
19
+
20
+ ```python
21
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
22
+
23
+ tokenizer = AutoTokenizer.from_pretrained("PseudoGenius")
24
+ model = AutoModelForSequenceClassification.from_pretrained("PseudoGenius")
25
+
26
+ # Example DNA and protein sequence
27
+ sequence = "ATGCGT\tMVKVYAPASSANMSVGFDVLGAAVTPVD"
28
+
29
+ inputs = tokenizer(sequence, return_tensors="pt")
30
+ outputs = model(**inputs)
31
+
32
+ # The outputs are raw logits; apply a softmax function to obtain probabilities
33
+ probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
34
+
35
+
36
+ Limitations and Bias
37
+ The model was trained on a specific dataset with particular characteristics. It might not generalize well to organisms with different genomic properties, such as a significantly different GC content.
38
+
39
+ Training Data
40
+ The model was trained on a dataset consisting of DNA and protein sequences from Mycobacterium leprae. The sequences are concatenated using a tab character.
41
+
42
+ Training Procedure
43
+ The model was fine-tuned on a DNA BERT (bert-base-uncased) model for 3 epochs, with a batch size of 8 and a learning rate of 2e-5.
44
+
45
+ Evaluation Results
46
+ The model achieved a precision, recall, and F1 score of 1.0 on the test set, indicating that it was able to classify the gene sequences with high accuracy. However, these results should be validated with additional testing, particularly on diverse datasets.