maotao / src /mRNA2vec /README.md
julse's picture
upload AA2CDS
4707555 verified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

mRNA2vec

This is the code for the AAAI25 paper mRNA2vec Alt text

Pre-training stage

we collect five species (human, rat, mouse, chicken, and zebrafish) mRNA sequences from the NIH with the datasets API

You also can download the pre-training data used in this paper.

The pre-training took approximately 3 hours on four Nvidia GeForce RTX 4090 GPUs.

torchrun --nproc_per_node=4 pretrain_mrna2vec.py

Downstream task

Using the checkpoint from the pre-training as the encode, we finetune the model on different downstream tasks.

You can also download our checkpoint pre-trained on 510K sequences.

For example, for the HEK dataset Translation Efficiency problem, the task_name = "HEK_TE". All downstream task data can be downloaded

python sft_exp.py --task_name "HEK_TE" --exp_name "d2v" --data_path "data1" --model_name "model_d2v_mfe0.1_ss0.001_specific.pt" --load_model True --cuda_device "3"

Licensee

This code is free to use for research purposes, but commercial use requires explicit permission from the author.

If you use this code in your research, please cite our paper:

@article{zhang2024mrna2vec,
  title={mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design},
  author={Zhang, Honggen and Gao, Xiangrui and Zhang, June and Lai, Lipeng},
  journal={arXiv preprint arXiv:2408.09048},
  year={2024}
}