Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,39 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
# T5 Biological Sequence + English Mixed Model
|
| 5 |
+
|
| 6 |
+
A T5-small model was trained on a mixture of DNA, protein sequences, and English text data, primarily for downstream fine-tuning tasks such as sequence function prediction.
|
| 7 |
+
|
| 8 |
+
## Tokenizer Training
|
| 9 |
+
T5 uses the Unigram tokenizer. The input data consists of DNA sequences, protein sequences, and English text.
|
| 10 |
+
|
| 11 |
+
The specific training script is: `t5_token_gene_eng.py`.
|
| 12 |
+
|
| 13 |
+
Tokenizer training requires more than 128GB of memory and can be time-consuming.
|
| 14 |
+
You may use the pre-trained tokenizer directly:
|
| 15 |
+
|
| 16 |
+
**trained_t5_gene_eng_tokenizer**
|
| 17 |
+
|
| 18 |
+
## Pre-training the T5 Model
|
| 19 |
+
A T5-large model was trained from scratch on a mixed dataset of DNA, protein sequences, and English text. The steps are as follows:
|
| 20 |
+
1. Obtain the T5 configuration by running `get_t5_config.ipynb`.
|
| 21 |
+
2. Prepare the mixed training data by running `combine_data.ipynb`.
|
| 22 |
+
3. Launch the pre-training script `./run_pt.sh`.
|
| 23 |
+
Training takes approximately 5 hours using 8x NVIDIA 4090 GPUs.
|
| 24 |
+
|
| 25 |
+
## Fine-tuning the T5 Model
|
| 26 |
+
1. **Protein Function Prediction**: `t5_gene_eng_abstract_ft_protein_fun.ipynb`
|
| 27 |
+
2. **Amazon Review Summarization** (for reference): `t5_gene_eng_abstract_ft_review.ipynb`
|
| 28 |
+
3. **CNN Article Summarization** (for reference): `t5_gene_eng_abstract_ft_cnn.ipynb`
|
| 29 |
+
4. **DNA-Protein Coding Prediction** (experimental, poor performance, for reference only): `t5_gene_eng_abstract_ft_dna_protein.ipynb`
|
| 30 |
+
|
| 31 |
+
## Additional Experiments
|
| 32 |
+
Directory: `multi_trans_lab`
|
| 33 |
+
This contains experimental tasks exploring cross-modal and cross-lingual transfer capabilities, such as English-to-Spanish summarization and even English-to-DNA sequence generation. These are research-oriented and provided for academic reference only.
|
| 34 |
+
|
| 35 |
+
- `NC_000001.11_chapter_1.fna.p1`: Partial human genome sequence data.
|
| 36 |
+
- `get_dna_summary.ipynb`: Generates summaries for genomic DNA sequences (can use different fine-tuned models; see fine-tuning section above).
|
| 37 |
+
- `get_gene_summary.ipynb`: Generates summaries for coding DNA regions (model can be swapped).
|
| 38 |
+
- `dna_abstract_search_bench.ipynb`: Indirectly evaluates summary quality via search-based methods. Results are currently poor; ongoing research.
|
| 39 |
+
- `abstract_trans_en_es.ipynb`: Baseline test for transferring English summarization capability to Spanish.
|