Model Card for DNAMotifTokenizer and its Pretrained Model

DNAMotifTokenizer is a tokenizer designed for DNA sequence modeling. Unlike traditional k-mer tokenizers that rely on fixed-length subwords, this tokenizer incorporates motif-level information to preserve biologically meaningful units in DNA. By aligning tokenization with biological motifs, the model aims to improve the interpretability and biological awareness of downstream genomic language models.

Note: The repository git cloned is an external, publicly available dependency required to run our code. It is not affiliated with the authors and does not affect anonymity.

Model Description

  • Developed by: Anonymous (for peer review)
  • Model type: BERT-based masked language model architecture (BertMLM)
  • Language(s) (NLP): DNA sequences

How to Get Started with the Model

  • Tokenize DNA sequences
    example data file: Anonymous-843q0u4q08/ExampleData/example.csv
bash 00.tokenize.sh
  • Load our pretrained model for downstream usage
from transformers import AutoTokenizer, AutoModel
model_name = 'Anonymous-843q0u4q08/DNAMotifTokenizer'
model = AutoModel.from_pretrained(model_name)  

Training Data

  • Human reference genome (hg38)

Training Hyperparameters

python run_pretrain_nocache_wandb.py \
                                --output_dir $output \
                                --model_type=motifBert \
                                --tokenizer_name=motif \
                                --config_name=$config_dir/config.json \
                                --project_name=DNAMotifTokenizer \
                                --do_train \
                                --train_data_file=None \
                                --train_data_path=${data_dir} \
                                --train_data_prefix=all_tokenized_train_ \
                                --do_eval \
                                --eval_data_file=${data_dir}/all_tokenized_val_00.txt \
                                --mlm \
                                --gradient_accumulation_steps 1 \
                                --per_gpu_train_batch_size 96 \
                                --per_gpu_eval_batch_size 96 \
                                --save_steps 1000 \
                                --save_total_limit 10 \
                                --max_steps 200000 \
                                --evaluate_during_training \
                                --logging_steps 1000 \
                                --line_by_line \
                                --learning_rate 4e-5 \
                                --block_size 512 \
                                --adam_epsilon 1e-6 \
                                --weight_decay 0.01 \
                                --beta1 0.9 \
                                --beta2 0.98 \
                                --mlm_probability 0.15 \
                                --warmup_steps 10000 \
                                --n_process 8 \
                                --overwrite_output_dir

Evaluation

Downstream task datasets & Metrics

Computational resources

  • Hardware Type: Nvidia H100 80G
  • Hours used for pretraining: ~80 hours
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support