Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
-
---
|
|
|
|
| 1 |
+
# Model Card for [HIV_BERT]
|
| 2 |
+
|
| 3 |
+
## Table of Contents
|
| 4 |
+
- [Table of Contents](#table-of-contents)
|
| 5 |
+
- [Summary](#model-summary)
|
| 6 |
+
- [Model Description](#model-description)
|
| 7 |
+
- [Intended Uses & Limitations](#intended-uses-&-limitations)
|
| 8 |
+
- [How to Use](#how-to-use)
|
| 9 |
+
- [Training Data](#training-data)
|
| 10 |
+
- [Training Procedure](#training-procedure)
|
| 11 |
+
- [Preprocessing](#preprocessing)
|
| 12 |
+
- [Training](#training)
|
| 13 |
+
- [Evaluation Results](#evaluation-results)
|
| 14 |
+
- [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info)
|
| 15 |
+
|
| 16 |
+
## Summary
|
| 17 |
+
|
| 18 |
+
[The HIV-BERT model was trained as a refinement of the ProtBert-BFD model (https://huggingface.co/Rostlab/prot_bert_bfd) for HIV centric tasks. It was refined with whole viral genomes from the Los Alamos HIV Sequence Database (https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). This pretraining is important for HIV related tasks as the original BFD database contains few viral proteins making it sub-optimal when used as the basis for transfer learning tasks. This model and other related HIV prediction tasks have been published (link).]
|
| 19 |
+
|
| 20 |
+
## Model Description
|
| 21 |
+
|
| 22 |
+
[Like the original ProtBert-BFD model, this model encodes each amino acid as an individual token. This model was trained using Masked Language Modeling: a process in which a random set of tokens are masked with the model trained on their prediction. This model was trained using the damlab/hiv_flt dataset with 256 amino acid chunks and a 15% mask rate.]
|
| 23 |
+
|
| 24 |
+
## Intended Uses & Limitations
|
| 25 |
+
|
| 26 |
+
[As a masked language model this tool can be used to predict expected mutations using a masking approach. This could be used to identify highly mutated sequences, sequencing artifacts, or other contexts. As a BERT model, this tool can also be used as the base for transfer learning. This pretrained model could be used as the base when developing HIV-specific classification tasks.]
|
| 27 |
+
|
| 28 |
+
## How to use
|
| 29 |
+
|
| 30 |
+
[Code snippet of AutoModelForMaskedLM prediction of V3 amino acids.]
|
| 31 |
+
|
| 32 |
+
## Training Data
|
| 33 |
+
|
| 34 |
+
[The dataset damlab/HIV_FLT was used to refine the original rostlab/Prot-bert-bfd. This dataset contains 1790 full HIV genomes from across the globe. When translated, these genomes contain approximately 3.9 million amino-acid tokens.]
|
| 35 |
+
|
| 36 |
+
## Training Procedure
|
| 37 |
+
|
| 38 |
+
### Preprocessing
|
| 39 |
+
|
| 40 |
+
[As with the rostlab/Prot-bert-bfd model, the rare amino acids U, Z, O, and B were converted to X and spaces were added between each amino acid. All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation.]
|
| 41 |
+
|
| 42 |
+
### Training
|
| 43 |
+
|
| 44 |
+
[Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset.]
|
| 45 |
+
|
| 46 |
+
## Evaluation Results
|
| 47 |
+
|
| 48 |
+
[Table of Prot-Bert and HIV-Bert loss on HIV sequence datasets]
|
| 49 |
+
|
| 50 |
+
## BibTeX Entry and Citation Info
|
| 51 |
+
|
| 52 |
+
[More Information Needed]
|
| 53 |
+
|
| 54 |
---
|
| 55 |
license: mit
|
| 56 |
+
---
|