Model Card for DNAMotifTokenizer and its Pretrained Model
DNAMotifTokenizer is a tokenizer designed for DNA sequence modeling. Unlike traditional k-mer tokenizers that rely on fixed-length subwords, this tokenizer incorporates motif-level information to preserve biologically meaningful units in DNA. By aligning tokenization with biological motifs, the model aims to improve the interpretability and biological awareness of downstream genomic language models.
Note: The repository git cloned is an external, publicly available dependency required to run our code. It is not affiliated with the authors and does not affect anonymity.
Model Description
- Developed by: Anonymous (for peer review)
- Model type: BERT-based masked language model architecture (BertMLM)
- Language(s) (NLP): DNA sequences
How to Get Started with the Model
- Tokenize DNA sequences
example data file:Anonymous-843q0u4q08/ExampleData/example.csv
bash 00.tokenize.sh
- Load our pretrained model for downstream usage
from transformers import AutoTokenizer, AutoModel
model_name = 'Anonymous-843q0u4q08/DNAMotifTokenizer'
model = AutoModel.from_pretrained(model_name)
Training Data
- Human reference genome (hg38)
Training Hyperparameters
python run_pretrain_nocache_wandb.py \
--output_dir $output \
--model_type=motifBert \
--tokenizer_name=motif \
--config_name=$config_dir/config.json \
--project_name=DNAMotifTokenizer \
--do_train \
--train_data_file=None \
--train_data_path=${data_dir} \
--train_data_prefix=all_tokenized_train_ \
--do_eval \
--eval_data_file=${data_dir}/all_tokenized_val_00.txt \
--mlm \
--gradient_accumulation_steps 1 \
--per_gpu_train_batch_size 96 \
--per_gpu_eval_batch_size 96 \
--save_steps 1000 \
--save_total_limit 10 \
--max_steps 200000 \
--evaluate_during_training \
--logging_steps 1000 \
--line_by_line \
--learning_rate 4e-5 \
--block_size 512 \
--adam_epsilon 1e-6 \
--weight_decay 0.01 \
--beta1 0.9 \
--beta2 0.98 \
--mlm_probability 0.15 \
--warmup_steps 10000 \
--n_process 8 \
--overwrite_output_dir
Evaluation
Downstream task datasets & Metrics
Genome Understanding Evaluation(GUE): only human data used, metric is Matthews Correlation Coefficient(MCC). Downloaded from https://huggingface.co/datasets/leannmlindsey/GUE
Nucleotide Transformer Benchmarks: metric is MCC. Downloaded from https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks_revised
Dart-Eval: Task 1-3 used, metric is accuracy(ACC). Downloaded from https://www.synapse.org/Synapse:syn59522070/wiki/628450
Genomic Benchmarks: Only human data used, metric is MCC. Downloaded from https://huggingface.co/datasets/katielink/genomic-benchmarks
SCREEN: The positive sequences of hg38 cCREs are downloaded from https://screen.wenglab.org/. We generated the negative sequences sampled from hg38 ourselves as described in our paper. Metric is MCC.
Computational resources
- Hardware Type: Nvidia H100 80G
- Hours used for pretraining: ~80 hours
- Downloads last month
- -