| # mRNA2vec | |
| This is the code for the AAAI25 paper [mRNA2vec](https://arxiv.org/pdf/2408.09048) | |
|  | |
| ## Pre-training stage | |
| we collect five species (human, rat, mouse, chicken, and zebrafish) mRNA sequences from the NIH with the | |
| [datasets API]( https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/) | |
| You also can download the [pre-training data](https://drive.google.com/drive/folders/1zTUZ9qGdjJJqdmjjzdZlmBUU8FgpTLtb?usp=sharing) used in this paper. | |
| The pre-training took approximately 3 hours on four Nvidia GeForce RTX 4090 GPUs. | |
| ```bash | |
| torchrun --nproc_per_node=4 pretrain_mrna2vec.py | |
| ``` | |
| ## Downstream task | |
| Using the checkpoint from the pre-training as the encode, we finetune the model on different downstream tasks. | |
| You can also download our [checkpoint](./checkpoint/model_d2v_mfe0.01_ss0.001_warmup.pt) pre-trained on 510K sequences. | |
| For example, for the HEK dataset Translation Efficiency problem, the task_name = "HEK_TE". All downstream task data can be [downloaded](https://drive.google.com/drive/folders/1zTUZ9qGdjJJqdmjjzdZlmBUU8FgpTLtb?usp=sharing) | |
| ```bash | |
| python sft_exp.py --task_name "HEK_TE" --exp_name "d2v" --data_path "data1" --model_name "model_d2v_mfe0.1_ss0.001_specific.pt" --load_model True --cuda_device "3" | |
| ``` | |
| ## Licensee | |
| This code is free to use for research purposes, but commercial use requires explicit permission from the author. | |
| If you use this code in your research, please cite our paper: | |
| ``` | |
| @article{zhang2024mrna2vec, | |
| title={mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design}, | |
| author={Zhang, Honggen and Gao, Xiangrui and Zhang, June and Lai, Lipeng}, | |
| journal={arXiv preprint arXiv:2408.09048}, | |
| year={2024} | |
| } | |