| # T5 Biological Sequence + English Mixed Model | |
| A T5-small model was trained on a mixture of DNA, protein sequences, and English text data, primarily for downstream fine-tuning tasks such as sequence function prediction. | |
| ## Tokenizer Training | |
| T5 uses the Unigram tokenizer. The input data consists of DNA sequences, protein sequences, and English text. | |
| The specific training script is: `t5_token_gene_eng.py`. | |
| Tokenizer training requires more than 128GB of memory and can be time-consuming. | |
| You may use the pre-trained tokenizer directly: | |
| **trained_t5_gene_eng_tokenizer** | |
| ## Pre-training the T5 Model | |
| A T5-large model was trained from scratch on a mixed dataset of DNA, protein sequences, and English text. The steps are as follows: | |
| 1. Obtain the T5 configuration by running `get_t5_config.ipynb`. | |
| 2. Prepare the mixed training data by running `combine_data.ipynb`. | |
| 3. Launch the pre-training script `./run_pt.sh`. | |
| Training takes approximately 5 hours using 8x NVIDIA 4090 GPUs. | |
| ## Fine-tuning the T5 Model | |
| 1. **Protein Function Prediction**: `t5_gene_eng_abstract_ft_protein_fun.ipynb` | |
| 2. **Amazon Review Summarization** (for reference): `t5_gene_eng_abstract_ft_review.ipynb` | |
| 3. **CNN Article Summarization** (for reference): `t5_gene_eng_abstract_ft_cnn.ipynb` | |
| 4. **DNA-Protein Coding Prediction** (experimental, poor performance, for reference only): `t5_gene_eng_abstract_ft_dna_protein.ipynb` | |
| ## Additional Experiments | |
| Directory: `multi_trans_lab` | |
| This contains experimental tasks exploring cross-modal and cross-lingual transfer capabilities, such as English-to-Spanish summarization and even English-to-DNA sequence generation. These are research-oriented and provided for academic reference only. | |
| - `NC_000001.11_chapter_1.fna.p1`: Partial human genome sequence data. | |
| - `get_dna_summary.ipynb`: Generates summaries for genomic DNA sequences (can use different fine-tuned models; see fine-tuning section above). | |
| - `get_gene_summary.ipynb`: Generates summaries for coding DNA regions (model can be swapped). | |
| - `dna_abstract_search_bench.ipynb`: Indirectly evaluates summary quality via search-based methods. Results are currently poor; ongoing research. | |
| - `abstract_trans_en_es.ipynb`: Baseline test for transferring English summarization capability to Spanish. | |
| ------ | |
| # T5 生物序列+英文混合模型/Chinese | |
| 主要由DNA、蛋白质序列加上英文数据,混合训练了一个T5-small模型,用于序列功能预测等微调测试 | |
| ## 分词器训练 | |
| T5 使用unigram分词器,输入为DNA、蛋白质序列加上英文数据 | |
| 具体代码为:t5_token_gene_eng.py | |
| 训练时间比较上,需要内存大于128G | |
| 可以使用训练好的分词器 | |
| trained_t5_gene_eng_tokenizer | |
| ## 预训练T5模型 | |
| DNA、蛋白质序列加上英文数据从头训练一个T5大模型,具体步骤: | |
| 1. 获得T5的配置,执行get_t5_config.ipynb即可 | |
| 2. 混合训练数据,执行combine_data.ipynb即可 | |
| 3. 执行训练脚本./run_pt.sh. 8 卡4090大致需要5个小时左右。 | |
| ## 微调T5模型 | |
| 1. 蛋白质序列功能预测:t5_gene_eng_abstract_ft_protein_fun.ipynb | |
| 2. amazon评论摘要,主要做参考:t5_gene_eng_abstract_ft_review.ipynb | |
| 3. cnn文章摘要,主要做参考:t5_gene_eng_abstract_ft_cnn.ipynb | |
| 4. dna蛋白质编码预测,这个效果不行,仅做参考:t5_gene_eng_abstract_ft_dna_protein.ipynb | |
| ## 一些额外的实验 | |
| multi_trans_lab目录 | |
| 主要做一些摘要的语言能力迁移实验,包括英语到西班牙语的 | |
| 以及英语到DNA序列的,为论文科研性质的,仅供参考。 | |
| - NC_000001.11_chapter_1.fna.p1 部分人类基因组数据 | |
| - get_dna_summary.ipynb 获得基因组dna序列的摘要,可以更换微调模型,具体见前面微调模型部分 | |
| - get_gene_summary.ipynb 获得编码区dan序列的摘要,可以更换微调模型 | |
| - dna_abastract_search_bench.ipynb 通过搜索的方式间接评价摘要获取的效果,结果很差,持续科研中 | |
| - abstact_trans_en_es.ipynb, 对照测试,英文摘要能力迁移到西班牙文 | |