# mRNA2vec
This is the code for the AAAI25 paper [mRNA2vec](https://arxiv.org/pdf/2408.09048)
![Alt text](./diagram_mRNA2vec.png)

## Pre-training stage 
we collect five species (human, rat, mouse, chicken, and zebrafish) mRNA sequences from the NIH with the
[datasets API]( https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/)

You also can download the [pre-training data](https://drive.google.com/drive/folders/1zTUZ9qGdjJJqdmjjzdZlmBUU8FgpTLtb?usp=sharing) used in this paper.

The pre-training took approximately 3 hours on four Nvidia GeForce RTX 4090 GPUs.
```bash
torchrun --nproc_per_node=4 pretrain_mrna2vec.py
```

## Downstream task
Using the checkpoint from the pre-training as the encode, we finetune the model on different downstream tasks.

You can also download our [checkpoint](./checkpoint/model_d2v_mfe0.01_ss0.001_warmup.pt) pre-trained on 510K sequences.

For example, for the HEK dataset Translation Efficiency problem, the task_name = "HEK_TE". All downstream task data can be [downloaded](https://drive.google.com/drive/folders/1zTUZ9qGdjJJqdmjjzdZlmBUU8FgpTLtb?usp=sharing)
```bash
python sft_exp.py --task_name "HEK_TE" --exp_name "d2v" --data_path "data1" --model_name "model_d2v_mfe0.1_ss0.001_specific.pt" --load_model True --cuda_device "3"
```

## Licensee 
This code is free to use for research purposes, but commercial use requires explicit permission from the author.

If you use this code in your research, please cite our paper:
```
@article{zhang2024mrna2vec,
  title={mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design},
  author={Zhang, Honggen and Gao, Xiangrui and Zhang, June and Lai, Lipeng},
  journal={arXiv preprint arXiv:2408.09048},
  year={2024}
}