|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
# T5 Biological Sequence + English Mixed Model |
|
|
|
|
|
A T5-small model was trained on a mixture of DNA, protein sequences, and English text data, primarily for downstream fine-tuning tasks such as sequence function prediction. |
|
|
|
|
|
## Tokenizer Training |
|
|
T5 uses the Unigram tokenizer. The input data consists of DNA sequences, protein sequences, and English text. |
|
|
|
|
|
The specific training script is: `t5_token_gene_eng.py`. |
|
|
|
|
|
Tokenizer training requires more than 128GB of memory and can be time-consuming. |
|
|
You may use the pre-trained tokenizer directly: |
|
|
|
|
|
**trained_t5_gene_eng_tokenizer** |
|
|
|
|
|
## Pre-training the T5 Model |
|
|
A T5-large model was trained from scratch on a mixed dataset of DNA, protein sequences, and English text. The steps are as follows: |
|
|
1. Obtain the T5 configuration by running `get_t5_config.ipynb`. |
|
|
2. Prepare the mixed training data by running `combine_data.ipynb`. |
|
|
3. Launch the pre-training script `./run_pt.sh`. |
|
|
Training takes approximately 5 hours using 8x NVIDIA 4090 GPUs. |
|
|
|
|
|
## Fine-tuning the T5 Model |
|
|
1. **Protein Function Prediction**: `t5_gene_eng_abstract_ft_protein_fun.ipynb` |
|
|
2. **Amazon Review Summarization** (for reference): `t5_gene_eng_abstract_ft_review.ipynb` |
|
|
3. **CNN Article Summarization** (for reference): `t5_gene_eng_abstract_ft_cnn.ipynb` |
|
|
4. **DNA-Protein Coding Prediction** (experimental, poor performance, for reference only): `t5_gene_eng_abstract_ft_dna_protein.ipynb` |
|
|
|
|
|
## Additional Experiments |
|
|
Directory: `multi_trans_lab` |
|
|
This contains experimental tasks exploring cross-modal and cross-lingual transfer capabilities, such as English-to-Spanish summarization and even English-to-DNA sequence generation. These are research-oriented and provided for academic reference only. |
|
|
|
|
|
- `NC_000001.11_chapter_1.fna.p1`: Partial human genome sequence data. |
|
|
- `get_dna_summary.ipynb`: Generates summaries for genomic DNA sequences (can use different fine-tuned models; see fine-tuning section above). |
|
|
- `get_gene_summary.ipynb`: Generates summaries for coding DNA regions (model can be swapped). |
|
|
- `dna_abstract_search_bench.ipynb`: Indirectly evaluates summary quality via search-based methods. Results are currently poor; ongoing research. |
|
|
- `abstract_trans_en_es.ipynb`: Baseline test for transferring English summarization capability to Spanish. |