Model Card for DNAMotifTokenizer and its Pretrained Model

DNAMotifTokenizer is a tokenizer designed for DNA sequence modeling. Unlike traditional k-mer tokenizers that rely on fixed-length subwords, this tokenizer incorporates motif-level information to preserve biologically meaningful units in DNA. By aligning tokenization with biological motifs, the model aims to improve the interpretability and biological awareness of downstream genomic language models.

Note: The repository git cloned is an external, publicly available dependency required to run our code. It is not affiliated with the authors and does not affect anonymity.

Model Description

Developed by: Anonymous (for peer review)
Model type: BERT-based masked language model architecture (BertMLM)
Language(s) (NLP): DNA sequences

How to Get Started with the Model

Tokenize DNA sequences
example data file: Anonymous-843q0u4q08/ExampleData/example.csv

bash 00.tokenize.sh

Load our pretrained model for downstream usage

from transformers import AutoTokenizer, AutoModel
model_name = 'Anonymous-843q0u4q08/DNAMotifTokenizer'
model = AutoModel.from_pretrained(model_name)

Training Data

Human reference genome (hg38)

Training Hyperparameters

python run_pretrain_nocache_wandb.py \
                                --output_dir $output \
                                --model_type=motifBert \
                                --tokenizer_name=motif \
                                --config_name=$config_dir/config.json \
                                --project_name=DNAMotifTokenizer \
                                --do_train \
                                --train_data_file=None \
                                --train_data_path=${data_dir} \
                                --train_data_prefix=all_tokenized_train_ \
                                --do_eval \
                                --eval_data_file=${data_dir}/all_tokenized_val_00.txt \
                                --mlm \
                                --gradient_accumulation_steps 1 \
                                --per_gpu_train_batch_size 96 \
                                --per_gpu_eval_batch_size 96 \
                                --save_steps 1000 \
                                --save_total_limit 10 \
                                --max_steps 200000 \
                                --evaluate_during_training \
                                --logging_steps 1000 \
                                --line_by_line \
                                --learning_rate 4e-5 \
                                --block_size 512 \
                                --adam_epsilon 1e-6 \
                                --weight_decay 0.01 \
                                --beta1 0.9 \
                                --beta2 0.98 \
                                --mlm_probability 0.15 \
                                --warmup_steps 10000 \
                                --n_process 8 \
                                --overwrite_output_dir

Evaluation

Downstream task datasets & Metrics

Genome Understanding Evaluation(GUE): only human data used, metric is Matthews Correlation Coefficient(MCC). Downloaded from https://huggingface.co/datasets/leannmlindsey/GUE
Nucleotide Transformer Benchmarks: metric is MCC. Downloaded from https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks_revised
Dart-Eval: Task 1-3 used, metric is accuracy(ACC). Downloaded from https://www.synapse.org/Synapse:syn59522070/wiki/628450
Genomic Benchmarks: Only human data used, metric is MCC. Downloaded from https://huggingface.co/datasets/katielink/genomic-benchmarks
SCREEN: The positive sequences of hg38 cCREs are downloaded from https://screen.wenglab.org/. We generated the negative sequences sampled from hg38 ourselves as described in our paper. Metric is MCC.

Computational resources

Hardware Type: Nvidia H100 80G
Hours used for pretraining: ~80 hours

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support