marisming commited on
Commit
26f5689
·
verified ·
1 Parent(s): a5e200c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -61
README.md CHANGED
@@ -1,62 +1,67 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
- DNAGPT2- The Best Beginner's Guide to Gene Sequence Large Language Models
5
-
6
- ### 1. Overview
7
- Large language models have long transcended the NLP research domain, becoming a cornerstone for AI in science. Gene sequences in bioinformatics are most similar to natural language, making the application of large models to biological sequence studies a hot research direction in recent years. The 2024 Nobel Prize in Chemistry awarded to AlphaFold for predicting protein structures has further illuminated the future path for biological research.
8
-
9
- However, for most biologists, large models remain unfamiliar territory. Until 2023, models like GPT were niche topics within NLP research, only gaining public attention due to the emergence of ChatGPT.
10
-
11
- Most biology + large model research has emerged post-2023, but the significant interdisciplinary gap means these studies are typically collaborative efforts by large companies and teams. Replicating or learning from this work is challenging for many researchers, as evidenced by the issues sections of top papers on GitHub.
12
-
13
- On one hand, large models are almost certain to shape the future of biological research; on the other, many researchers hesitate at the threshold of large models. Providing a bridge over this gap is thus an urgent need.
14
-
15
- DNAGTP2 serves as this bridge, aiming to facilitate more biologists in overcoming the large model barrier and leveraging these powerful tools to advance their work.
16
-
17
- ### 2. Tutorial Characteristics
18
- This tutorial is characterized by:
19
-
20
- 1. **Simplicity**: Simple code entirely built using Hugging Face’s standard libraries.
21
- 2. **Simplicity**: Basic theoretical explanations with full visual aids.
22
- 3. **Simplicity**: Classic paper cases that are easy to understand.
23
-
24
- Despite its simplicity, the tutorial covers comprehensive content, from building tokenizers to constructing GPT, BERT models from scratch, fine-tuning LLaMA models, basic DeepSpeed multi-GPU distributed training, and applying SOTA models like LucaOne and ESM3. It combines typical biological tasks such as sequence classification, structure prediction, and regression analysis, progressively unfolding.
25
-
26
- ### Target Audience:
27
- 1. Researchers and students in the field of biology, especially bioinformatics.
28
- 2. Beginners in large model learning, applicable beyond just biology.
29
-
30
- ### 3. Tutorial Outline
31
- #### 1 Data and Environment
32
- 1.1 Introduction to Large Model Runtime Environments
33
- 1.2 Pre-trained and Fine-tuning Data Related to Genes
34
- 1.3 Basic Usage of Datasets Library
35
-
36
- #### 2 Building DNA GPT2/Bert Large Models from Scratch
37
- 2.1 Building DNA Tokenizer
38
- 2.2 Training DNA GPT2 Model from Scratch
39
- 2.3 Training DNA Bert Model from Scratch
40
- 2.4 Feature Extraction for Biological Sequences Using Gene Large Models
41
- 2.5 Building Large Models Based on Multimodal Data
42
-
43
- #### 3 Biological Sequence Tasks Using Gene Large Models
44
- 3.1 Sequence Classification Task
45
- 3.2 Structure Prediction Task
46
- 3.3 Multi-sequence Interaction Analysis
47
- 3.4 Function Prediction Task
48
- 3.5 Regression Tasks
49
-
50
- #### 4 Entering the ChatGPT Era: Gene Instruction Building and Fine-tuning
51
- 4.1 Expanding LLaMA Vocabulary Based on Gene Data
52
- 4.2 Introduction to DeepSpeed Distributed Training
53
- 4.3 Continuous Pre-training of LLaMA Model Based on Gene Data
54
- 4.4 Classification Task Using LLaMA-gene Large Model
55
- 4.5 Instruction Fine-tuning Based on LLaMA-gene Large Model
56
-
57
- #### 5 Overview of SOTA Large Model Applications in Biology
58
- 5.1 Application of DNABERT2
59
- 5.2 Usage of LucaOne
60
- 5.3 Usage of ESM3
61
- 5.4 Application of MedGPT
 
 
 
 
 
62
  5.5 Application of LLaMA-gene
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ DNAGPT2- The Best Beginner's Guide to Gene Sequence Large Language Models
5
+
6
+ ### 1. Overview
7
+ Large language models have long transcended the NLP research domain, becoming a cornerstone for AI in science. Gene sequences in bioinformatics are most similar to natural language, making the application of large models to biological sequence studies a hot research direction in recent years. The 2024 Nobel Prize in Chemistry awarded to AlphaFold for predicting protein structures has further illuminated the future path for biological research.
8
+
9
+ However, for most biologists, large models remain unfamiliar territory. Until 2023, models like GPT were niche topics within NLP research, only gaining public attention due to the emergence of ChatGPT.
10
+
11
+ Most biology + large model research has emerged post-2023, but the significant interdisciplinary gap means these studies are typically collaborative efforts by large companies and teams. Replicating or learning from this work is challenging for many researchers, as evidenced by the issues sections of top papers on GitHub.
12
+
13
+ On one hand, large models are almost certain to shape the future of biological research; on the other, many researchers hesitate at the threshold of large models. Providing a bridge over this gap is thus an urgent need.
14
+
15
+ DNAGTP2 serves as this bridge, aiming to facilitate more biologists in overcoming the large model barrier and leveraging these powerful tools to advance their work.
16
+
17
+ ### video lectures
18
+ video lectures(DNAGTP2-基因序列大模型入门,DNAGPT2: An Introduction to Large-Scale Models for Gene Sequences):
19
+
20
+ https://www.bilibili.com/video/BV1zEktYXEPK/?vd_source=ecb3dac8a5835b71df2462be9d8e102e
21
+
22
+ ### 2. Tutorial Characteristics
23
+ This tutorial is characterized by:
24
+
25
+ 1. **Simplicity**: Simple code entirely built using Hugging Face’s standard libraries.
26
+ 2. **Simplicity**: Basic theoretical explanations with full visual aids.
27
+ 3. **Simplicity**: Classic paper cases that are easy to understand.
28
+
29
+ Despite its simplicity, the tutorial covers comprehensive content, from building tokenizers to constructing GPT, BERT models from scratch, fine-tuning LLaMA models, basic DeepSpeed multi-GPU distributed training, and applying SOTA models like LucaOne and ESM3. It combines typical biological tasks such as sequence classification, structure prediction, and regression analysis, progressively unfolding.
30
+
31
+ ### Target Audience:
32
+ 1. Researchers and students in the field of biology, especially bioinformatics.
33
+ 2. Beginners in large model learning, applicable beyond just biology.
34
+
35
+ ### 3. Tutorial Outline
36
+ #### 1 Data and Environment
37
+ 1.1 Introduction to Large Model Runtime Environments
38
+ 1.2 Pre-trained and Fine-tuning Data Related to Genes
39
+ 1.3 Basic Usage of Datasets Library
40
+
41
+ #### 2 Building DNA GPT2/Bert Large Models from Scratch
42
+ 2.1 Building DNA Tokenizer
43
+ 2.2 Training DNA GPT2 Model from Scratch
44
+ 2.3 Training DNA Bert Model from Scratch
45
+ 2.4 Feature Extraction for Biological Sequences Using Gene Large Models
46
+ 2.5 Building Large Models Based on Multimodal Data
47
+
48
+ #### 3 Biological Sequence Tasks Using Gene Large Models
49
+ 3.1 Sequence Classification Task
50
+ 3.2 Structure Prediction Task
51
+ 3.3 Multi-sequence Interaction Analysis
52
+ 3.4 Function Prediction Task
53
+ 3.5 Regression Tasks
54
+
55
+ #### 4 Entering the ChatGPT Era: Gene Instruction Building and Fine-tuning
56
+ 4.1 Expanding LLaMA Vocabulary Based on Gene Data
57
+ 4.2 Introduction to DeepSpeed Distributed Training
58
+ 4.3 Continuous Pre-training of LLaMA Model Based on Gene Data
59
+ 4.4 Classification Task Using LLaMA-gene Large Model
60
+ 4.5 Instruction Fine-tuning Based on LLaMA-gene Large Model
61
+
62
+ #### 5 Overview of SOTA Large Model Applications in Biology
63
+ 5.1 Application of DNABERT2
64
+ 5.2 Usage of LucaOne
65
+ 5.3 Usage of ESM3
66
+ 5.4 Application of MedGPT
67
  5.5 Application of LLaMA-gene