yuccaaa commited on
Commit
68d0480
·
verified ·
1 Parent(s): dbb3952

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -3
README.md CHANGED
@@ -1,3 +1,101 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # BioBridge: Bridging Proteins and Language for Enhanced Biological Reasoning with LLMs
6
+
7
+ ## Paper Introduction
8
+ BioBridge is a domain-adaptive continual pretraining framework designed to fuse the advantages of Protein Language Models (PLMs) and general-purpose Large Language Models (LLMs). It addresses two core challenges in biological reasoning:
9
+ 1. The **biological knowledge barrier** of general LLMs (lack of domain-specific protein understanding).
10
+ 2. The **poor generalization** of specialized PLMs (limited adaptability to multi-task scenarios).
11
+
12
+ Key innovations of BioBridge include:
13
+ - **Domain-Incremental Continual Pre-training (DICP)**: Infuses biomedical knowledge into LLMs via specialized corpora (e.g., biology textbooks, PubMed articles) while mitigating catastrophic forgetting of general language capabilities.
14
+ - **PLM-Projector Module**: Uses ESM2 (a state-of-the-art PLM) as a protein encoder and a cross-modal projector to map protein sequence embeddings into the LLM’s semantic space, enabling effective protein-text alignment.
15
+ - **End-to-End Optimization**: Unifies pre-training and alignment stages to support multi-task biological reasoning (e.g., protein property prediction, knowledge question-answering) without task-specific retraining.
16
+
17
+ Extensive experiments validate that BioBridge performs comparably to mainstream PLMs (e.g., ESM2) on protein benchmarks (PFMBench) and maintains strong general language capabilities on datasets like MMLU and RACE, showcasing its unique value in balancing domain adaptability and general reasoning competency.
18
+
19
+ ## Installation
20
+ 1. Clone the repository:
21
+ ```bash
22
+ git clone https://github.com/Yuccaaa/biobridge.git
23
+ cd biobridge
24
+ ```
25
+
26
+ 2. Install dependencies:
27
+ python=3.10
28
+ ```bash
29
+ pip install flash_attn-2.5.8-cp310-cp310-linux_x86_64.whl # 替换为实际文件名
30
+ ```bash
31
+ Install Lavis: pip install rouge_score nltk salesforce-lavis
32
+ Install others: pip install -U transformers pytorch-lightning
33
+
34
+
35
+
36
+ ## Data
37
+ The training data for BioBridge integrates multiple sources to ensure comprehensive biomedical coverage and general reasoning retention. For detailed data collection, preprocessing pipelines, and format specifications, refer to the **Materials and Methods** section of the original paper. Key data sources include:
38
+ - Biomedical corpora: Biology textbooks, PubMed Central articles/abstracts, sequence-augmented sentences (via BERN2 named entity recognition).
39
+ - Protein-text pairs: 90K Swiss-Prot entries, 422K OntoProtein pairs (covering molecular functions, biological processes).
40
+ - General reasoning data: Mixture of Thoughts (MoT) corpus (93K math, 83K code, 173K scientific problems) to prevent catastrophic forgetting.
41
+
42
+
43
+ ## Usage
44
+ Modify experimental settings (model hyperparameters, data paths, training configurations) in `configure.py` before running the code. The framework supports three core training stages, as outlined below:
45
+
46
+ ### 1. Domain-Incremental Continual Pre-training (DICP)
47
+ This stage adapts the base LLM (Qwen2.5-7B-Instruct) to biomedical data while preserving general language capabilities.
48
+ The pre-training implementation for this project is based on ModelScope's SWIFT (Scalable Lightweight Framework for Tuning) framework.
49
+ SWIFT Framework Repository: https://github.com/modelscope/swift
50
+
51
+
52
+ ### 2. PLM-Projector Cross-Modal Alignment
53
+ Uses ESM2 (frozen protein encoder) and Q-Former to align protein embeddings with the LLM’s semantic space via contrastive learning.
54
+ - Run command:
55
+ ```bash
56
+ python stage1.py \
57
+ --devices $DEVICES \
58
+ --mode $MODE \
59
+ --filename $FILENAME \
60
+ --num_query_token $NUM_QUERY_TOKEN \
61
+ --plm_name $PLM_NAME \
62
+ --bert_name $BERT_NAME \
63
+ --save_every_n_epochs $SAVE_EVERY \
64
+ --batch_size $BATCH_SIZE \
65
+ --precision $PRECISION \
66
+ --mix_dataset \
67
+ --num_workers $NUM_WORKERS \
68
+ --strategy $STRATEGY \
69
+ --use_wandb_logger
70
+
71
+ ```
72
+
73
+
74
+ ### 3. End-to-End Fine-Tuning
75
+ Unifies the pre-trained LLM and alignment module to enable multi-task biological reasoning (no downstream task-specific data required).
76
+ - Run command:
77
+ ```bash
78
+ python stage2.py
79
+ --devices '0,1,2,3,4,5,6,7'
80
+ --mode train
81
+ --filename stage2_07301646_2datasets_construct
82
+ --num_query_token 8
83
+ --save_every_n_epochs 2
84
+ --max_epochs 10
85
+ --batch_size 4
86
+ --precision 'bf16-mixed'
87
+ --num_workers 8
88
+ --plm_model /nas/shared/kilab/wangyujia/ProtT3/plm_model/esm2-150m
89
+ --bert_name /nas/shared/kilab/wangyujia/ProtT3/plm_model/microsoft
90
+ --llm_name /oss/wangyujia/BIO/construction_finetuning/alpaca/v1-20250609-141541/checkpoint-50-merged
91
+ --llm_tune mid_lora
92
+ --stage1_path /nas/shared/kilab/wangyujia/ProtT3/all_checkpoints/stage1_07041727_2dataset/epoch=29.ckpt/converted.ckpt
93
+ --use_wandb_logger
94
+ --dataset swiss-prot
95
+ ```
96
+
97
+
98
+ ### Model Weights
99
+ Pretrained and fine-tuned model weights are available for download at:
100
+ https://huggingface.co/yuccaaa/biobridge
101
+