Update README.md
Browse files
README.md
CHANGED
|
@@ -41,13 +41,13 @@ As a masked language model this tool can be used to predict expected mutations u
|
|
| 41 |
|
| 42 |
## Training Data
|
| 43 |
|
| 44 |
-
The dataset damlab/HIV_FLT was used to refine the original rostlab/Prot-bert-bfd. This dataset contains 1790 full HIV genomes from across the globe. When translated, these genomes contain approximately 3.9 million amino-acid tokens.
|
| 45 |
|
| 46 |
## Training Procedure
|
| 47 |
|
| 48 |
### Preprocessing
|
| 49 |
|
| 50 |
-
As with the rostlab/Prot-bert-bfd model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
|
| 51 |
|
| 52 |
### Training
|
| 53 |
|
|
|
|
| 41 |
|
| 42 |
## Training Data
|
| 43 |
|
| 44 |
+
The dataset [damlab/HIV_FLT](https://huggingface.co/datasets/damlab/HIV_FLT) was used to refine the original [rostlab/Prot-bert-bfd](https://huggingface.co/Rostlab/prot_bert_bfd). This dataset contains 1790 full HIV genomes from across the globe. When translated, these genomes contain approximately 3.9 million amino-acid tokens.
|
| 45 |
|
| 46 |
## Training Procedure
|
| 47 |
|
| 48 |
### Preprocessing
|
| 49 |
|
| 50 |
+
As with the [rostlab/Prot-bert-bfd](https://huggingface.co/Rostlab/prot_bert_bfd) model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.
|
| 51 |
|
| 52 |
### Training
|
| 53 |
|