A newer version of the Gradio SDK is available:
6.5.1
mRNA2vec
This is the code for the AAAI25 paper mRNA2vec

Pre-training stage
we collect five species (human, rat, mouse, chicken, and zebrafish) mRNA sequences from the NIH with the datasets API
You also can download the pre-training data used in this paper.
The pre-training took approximately 3 hours on four Nvidia GeForce RTX 4090 GPUs.
torchrun --nproc_per_node=4 pretrain_mrna2vec.py
Downstream task
Using the checkpoint from the pre-training as the encode, we finetune the model on different downstream tasks.
You can also download our checkpoint pre-trained on 510K sequences.
For example, for the HEK dataset Translation Efficiency problem, the task_name = "HEK_TE". All downstream task data can be downloaded
python sft_exp.py --task_name "HEK_TE" --exp_name "d2v" --data_path "data1" --model_name "model_d2v_mfe0.1_ss0.001_specific.pt" --load_model True --cuda_device "3"
Licensee
This code is free to use for research purposes, but commercial use requires explicit permission from the author.
If you use this code in your research, please cite our paper:
@article{zhang2024mrna2vec,
title={mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design},
author={Zhang, Honggen and Gao, Xiangrui and Zhang, June and Lai, Lipeng},
journal={arXiv preprint arXiv:2408.09048},
year={2024}
}