T5 Biological Sequence + English Mixed Model

A T5-small model was trained on a mixture of DNA, protein sequences, and English text data, primarily for downstream fine-tuning tasks such as sequence function prediction.

Tokenizer Training

T5 uses the Unigram tokenizer. The input data consists of DNA sequences, protein sequences, and English text.

The specific training script is: t5_token_gene_eng.py.

Tokenizer training requires more than 128GB of memory and can be time-consuming.
You may use the pre-trained tokenizer directly:

trained_t5_gene_eng_tokenizer

Pre-training the T5 Model

A T5-large model was trained from scratch on a mixed dataset of DNA, protein sequences, and English text. The steps are as follows:

Obtain the T5 configuration by running get_t5_config.ipynb.
Prepare the mixed training data by running combine_data.ipynb.
Launch the pre-training script ./run_pt.sh.
Training takes approximately 5 hours using 8x NVIDIA 4090 GPUs.

Fine-tuning the T5 Model

Protein Function Prediction: t5_gene_eng_abstract_ft_protein_fun.ipynb
Amazon Review Summarization (for reference): t5_gene_eng_abstract_ft_review.ipynb
CNN Article Summarization (for reference): t5_gene_eng_abstract_ft_cnn.ipynb
DNA-Protein Coding Prediction (experimental, poor performance, for reference only): t5_gene_eng_abstract_ft_dna_protein.ipynb

Additional Experiments

Directory: multi_trans_lab
This contains experimental tasks exploring cross-modal and cross-lingual transfer capabilities, such as English-to-Spanish summarization and even English-to-DNA sequence generation. These are research-oriented and provided for academic reference only.

NC_000001.11_chapter_1.fna.p1: Partial human genome sequence data.
get_dna_summary.ipynb: Generates summaries for genomic DNA sequences (can use different fine-tuned models; see fine-tuning section above).
get_gene_summary.ipynb: Generates summaries for coding DNA regions (model can be swapped).
dna_abstract_search_bench.ipynb: Indirectly evaluates summary quality via search-based methods. Results are currently poor; ongoing research.
abstract_trans_en_es.ipynb: Baseline test for transferring English summarization capability to Spanish.

T5 生物序列+英文混合模型/Chinese

主要由DNA、蛋白质序列加上英文数据，混合训练了一个T5-small模型，用于序列功能预测等微调测试

分词器训练

T5 使用unigram分词器，输入为DNA、蛋白质序列加上英文数据

具体代码为：t5_token_gene_eng.py

训练时间比较上，需要内存大于128G

可以使用训练好的分词器

trained_t5_gene_eng_tokenizer

预训练T5模型

DNA、蛋白质序列加上英文数据从头训练一个T5大模型，具体步骤：

获得T5的配置，执行get_t5_config.ipynb即可
混合训练数据，执行combine_data.ipynb即可
执行训练脚本./run_pt.sh. 8 卡4090大致需要5个小时左右。

微调T5模型

蛋白质序列功能预测：t5_gene_eng_abstract_ft_protein_fun.ipynb
amazon评论摘要，主要做参考：t5_gene_eng_abstract_ft_review.ipynb
cnn文章摘要，主要做参考：t5_gene_eng_abstract_ft_cnn.ipynb
dna蛋白质编码预测，这个效果不行，仅做参考：t5_gene_eng_abstract_ft_dna_protein.ipynb

一些额外的实验

multi_trans_lab目录主要做一些摘要的语言能力迁移实验，包括英语到西班牙语的以及英语到DNA序列的，为论文科研性质的，仅供参考。

NC_000001.11_chapter_1.fna.p1 部分人类基因组数据
get_dna_summary.ipynb 获得基因组dna序列的摘要，可以更换微调模型，具体见前面微调模型部分
get_gene_summary.ipynb 获得编码区dan序列的摘要，可以更换微调模型
dna_abastract_search_bench.ipynb 通过搜索的方式间接评价摘要获取的效果，结果很差，持续科研中
abstact_trans_en_es.ipynb, 对照测试，英文摘要能力迁移到西班牙文