| | --- |
| | tags: |
| | - biology |
| | - DNA |
| | - genomics |
| | --- |
| | This is the official pre-trained model introduced in [DNA language model GROVER learns sequence context in the human genome](https://www.nature.com/articles/s42256-024-00872-0) |
| |
|
| |
|
| |
|
| | from transformers import AutoTokenizer, AutoModelForMaskedLM |
| | import torch |
| | |
| | # Import the tokenizer and the model |
| | tokenizer = AutoTokenizer.from_pretrained("PoetschLab/GROVER") |
| | model = AutoModelForMaskedLM.from_pretrained("PoetschLab/GROVER") |
| | |
| |
|
| | Some preliminary analysis shows that sequence re-tokenization using Byte Pair Encoding (BPE) changes significantly if the sequence is less than 50 nucleotides long. Longer than 50 nucleotides, you should still be careful with sequence edges. |
| | We advice to add 100 nucleotides at the beginning and end of every sequence in order to garantee that your sequence is represented with the same tokens as the original tokenization. |
| | We also provide the tokenized chromosomes with their respective nucleotide mappers (They are available in the folder tokenized chromosomes). |
| |
|
| | ### BibTeX entry and citation info |
| |
|
| | ```bibtex |
| | @article{sanabria2024dna, |
| | title={DNA language model GROVER learns sequence context in the human genome}, |
| | author={Sanabria, Melissa and Hirsch, Jonas and Joubert, Pierre M and Poetsch, Anna R}, |
| | journal={Nature Machine Intelligence}, |
| | pages={1--13}, |
| | year={2024}, |
| | publisher={Nature Publishing Group UK London} |
| | } |
| | ``` |
| |
|