Spaces:

julse
/

maotao

Running

App Files Files Community

maotao / src /mRNA2vec /README.md

julse

upload AA2CDS

4707555 verified about 2 months ago

preview code

raw

history blame contribute delete

1.76 kB

	# mRNA2vec
	This is the code for the AAAI25 paper [mRNA2vec](https://arxiv.org/pdf/2408.09048)
	![Alt text](./diagram_mRNA2vec.png)

	## Pre-training stage
	we collect five species (human, rat, mouse, chicken, and zebrafish) mRNA sequences from the NIH with the
	[datasets API]( https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/)

	You also can download the [pre-training data](https://drive.google.com/drive/folders/1zTUZ9qGdjJJqdmjjzdZlmBUU8FgpTLtb?usp=sharing) used in this paper.

	The pre-training took approximately 3 hours on four Nvidia GeForce RTX 4090 GPUs.
	```bash
	torchrun --nproc_per_node=4 pretrain_mrna2vec.py
	```

	## Downstream task
	Using the checkpoint from the pre-training as the encode, we finetune the model on different downstream tasks.

	You can also download our [checkpoint](./checkpoint/model_d2v_mfe0.01_ss0.001_warmup.pt) pre-trained on 510K sequences.

	For example, for the HEK dataset Translation Efficiency problem, the task_name = "HEK_TE". All downstream task data can be [downloaded](https://drive.google.com/drive/folders/1zTUZ9qGdjJJqdmjjzdZlmBUU8FgpTLtb?usp=sharing)
	```bash
	python sft_exp.py --task_name "HEK_TE" --exp_name "d2v" --data_path "data1" --model_name "model_d2v_mfe0.1_ss0.001_specific.pt" --load_model True --cuda_device "3"
	```

	## Licensee
	This code is free to use for research purposes, but commercial use requires explicit permission from the author.

	If you use this code in your research, please cite our paper:
	```
	@article{zhang2024mrna2vec,
	title={mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design},
	author={Zhang, Honggen and Gao, Xiangrui and Zhang, June and Lai, Lipeng},
	journal={arXiv preprint arXiv:2408.09048},
	year={2024}
	}