Add example usage
Browse files
README.md
CHANGED
|
@@ -8,3 +8,26 @@ The model has been trained for a total of 17 epochs.
|
|
| 8 |
|
| 9 |
The loss curve is shown:
|
| 10 |

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
The loss curve is shown:
|
| 10 |

|
| 11 |
+
|
| 12 |
+
## Example Usage
|
| 13 |
+
|
| 14 |
+
```
|
| 15 |
+
from transformers import PreTrainedTokenizerFast, BertForMaskedLM
|
| 16 |
+
|
| 17 |
+
model = BertForMaskedLM.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")
|
| 18 |
+
model.eval()
|
| 19 |
+
|
| 20 |
+
tokenizer = PreTrainedTokenizerFast.from_pretrained("LofiAmazon/BarcodeBERT-Entire-BOLD")
|
| 21 |
+
|
| 22 |
+
# The DNA sequence you want to predict.
|
| 23 |
+
# There should be a space after every 4 characters.
|
| 24 |
+
# The sequence may also have unknown characters which are not A,C,T,G.
|
| 25 |
+
# The maximum DNA sequence length (not counting spaces) should be 660 characters
|
| 26 |
+
dna_sequence = "AACA ATGT ATTT A-T- TTCG CCCT TGTG AATT TATT ..."
|
| 27 |
+
|
| 28 |
+
inputs = tokenizer(dna_sequence, return_tensors="pt")
|
| 29 |
+
|
| 30 |
+
# Obtain a DNA embedding, which is a vector of length 768.
|
| 31 |
+
# The embedding is a representation of this DNA sequence in the model's latent space.
|
| 32 |
+
embedding = model(**inputs).hidden_states[-1].mean(1).squeeze()
|
| 33 |
+
```
|