marisming commited on
Commit
9133bd5
·
verified ·
1 Parent(s): ba38a19

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -3
README.md CHANGED
@@ -1,3 +1,39 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # T5 Biological Sequence + English Mixed Model
5
+
6
+ A T5-small model was trained on a mixture of DNA, protein sequences, and English text data, primarily for downstream fine-tuning tasks such as sequence function prediction.
7
+
8
+ ## Tokenizer Training
9
+ T5 uses the Unigram tokenizer. The input data consists of DNA sequences, protein sequences, and English text.
10
+
11
+ The specific training script is: `t5_token_gene_eng.py`.
12
+
13
+ Tokenizer training requires more than 128GB of memory and can be time-consuming.
14
+ You may use the pre-trained tokenizer directly:
15
+
16
+ **trained_t5_gene_eng_tokenizer**
17
+
18
+ ## Pre-training the T5 Model
19
+ A T5-large model was trained from scratch on a mixed dataset of DNA, protein sequences, and English text. The steps are as follows:
20
+ 1. Obtain the T5 configuration by running `get_t5_config.ipynb`.
21
+ 2. Prepare the mixed training data by running `combine_data.ipynb`.
22
+ 3. Launch the pre-training script `./run_pt.sh`.
23
+ Training takes approximately 5 hours using 8x NVIDIA 4090 GPUs.
24
+
25
+ ## Fine-tuning the T5 Model
26
+ 1. **Protein Function Prediction**: `t5_gene_eng_abstract_ft_protein_fun.ipynb`
27
+ 2. **Amazon Review Summarization** (for reference): `t5_gene_eng_abstract_ft_review.ipynb`
28
+ 3. **CNN Article Summarization** (for reference): `t5_gene_eng_abstract_ft_cnn.ipynb`
29
+ 4. **DNA-Protein Coding Prediction** (experimental, poor performance, for reference only): `t5_gene_eng_abstract_ft_dna_protein.ipynb`
30
+
31
+ ## Additional Experiments
32
+ Directory: `multi_trans_lab`
33
+ This contains experimental tasks exploring cross-modal and cross-lingual transfer capabilities, such as English-to-Spanish summarization and even English-to-DNA sequence generation. These are research-oriented and provided for academic reference only.
34
+
35
+ - `NC_000001.11_chapter_1.fna.p1`: Partial human genome sequence data.
36
+ - `get_dna_summary.ipynb`: Generates summaries for genomic DNA sequences (can use different fine-tuned models; see fine-tuning section above).
37
+ - `get_gene_summary.ipynb`: Generates summaries for coding DNA regions (model can be swapped).
38
+ - `dna_abstract_search_bench.ipynb`: Indirectly evaluates summary quality via search-based methods. Results are currently poor; ongoing research.
39
+ - `abstract_trans_en_es.ipynb`: Baseline test for transferring English summarization capability to Spanish.