DennisHuang648 commited on
Commit
7e2b0e8
Β·
verified Β·
1 Parent(s): a86a523

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +258 -3
README.md CHANGED
@@ -1,3 +1,258 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ pipeline_tag: text-classification
7
+ ---
8
+ <div align="center">
9
+ <h1>SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules</h1>
10
+
11
+ <a href='SciCore_Mol_Technical_Report.pdf'><img src='https://img.shields.io/badge/Paper-Technical_Report-red'></a>
12
+ <a href='https://huggingface.co/openbmb/SciCore-Mol'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a>
13
+ <a href='https://chenyxyx-scicore-mol.hf.space'><img src='https://img.shields.io/badge/Demo-Hugging_Face_Space-0f8b7d'></a>
14
+
15
+ **Yuxuan Chen**<sup>1</sup>, **Changwei Lv**<sup>2</sup>, **Yunduo Xiao**<sup>2</sup>, **Yukun Yan**<sup>2</sup>, **Zheni Zeng**<sup>*3</sup>, and **Zhiyuan Liu**<sup>2</sup>
16
+
17
+ <sup>1</sup>School of Electronic and Computer Engineering, Peking University, Shenzhen, China<br>
18
+ <sup>2</sup>Tsinghua University, Beijing, China<br>
19
+ <sup>3</sup>School of Intelligence Science and Technology, Nanjing University, Nanjing, China<br>
20
+ <sup>*</sup>Corresponding author: zengzn@nju.edu.cn
21
+
22
+ </div>
23
+
24
+ ## πŸ“– Introduction
25
+
26
+ Large language models (LLMs) are increasingly popular in professional domains, while meet a fundamental cognitive tension when dealing with heterogeneous scientific data: LLMs are designed for discrete natural language symbolic sequences, whereas scientific entities represented by molecules are inherently topological and geometric. Forcing these structures into linear text inevitably results in information loss and semantic noise interferes with the LLM's cognitive reasoning.
27
+
28
+ We propose **SciCore-Mol**, a novel paradigm to augment the LLM with pluggable external cognitive modules, including a **GVP encoder**, a **diffusion generator**, and a **numerical-sensitive Transformer** (Reaction Transformer). This architecture preserves the general capabilities while providing specialized molecular perception for LLMs. With a two-stage alignment mechanism, external modules are invoked via special tokens and fused at the hidden-state level, enabling the LLM to deeply understand molecular information without sacrificing its core reasoning process.
29
+
30
+ <p align="center"><img src="figs/fig3.png" width="90%"></p>
31
+
32
+ ## βš™οΈ Setup
33
+
34
+ ### Prerequisites
35
+
36
+ - Python 3.10
37
+ - CUDA 12.1
38
+ - 8x A800/A100 80GB GPUs (recommended for full training)
39
+
40
+ ### Installation
41
+
42
+ ```bash
43
+ git clone https://github.com/ChenYX24/SciCore-Mol.git
44
+ cd SciCore-Mol
45
+
46
+ # Option A: Install with uv (recommended)
47
+ pip install uv
48
+ uv sync
49
+ uv sync --extra graph # GVP-GNN dependencies (torch-geometric, torch-scatter, torch-cluster)
50
+ uv sync --extra flashattn # FlashAttention (requires CUDA)
51
+ uv sync --group train # DeepSpeed for distributed training
52
+
53
+ # Option B: Install with pip
54
+ python -m venv .venv
55
+ source .venv/bin/activate
56
+ pip install -e .
57
+ pip install -e ".[graph]" # optional: GVP-GNN
58
+ pip install -e ".[flashattn]" # optional: FlashAttention
59
+ pip install deepspeed swanlab # optional: distributed training
60
+ ```
61
+
62
+ ### Environment Variables
63
+
64
+ ```bash
65
+ cp configs/env.example.sh configs/env.sh
66
+ # Edit configs/env.sh to set your paths, then:
67
+ source configs/env.sh
68
+ ```
69
+
70
+ | Variable | Description |
71
+ |----------|-------------|
72
+ | `SCICORE_ROOT` | Project root directory |
73
+ | `MODEL_DIR` | Base model directory (e.g., Qwen3-8B) |
74
+ | `CHECKPOINT_DIR` | Trained checkpoint directory |
75
+ | `DATA_DIR` | Training and evaluation data |
76
+ | `GVP_CHECKPOINT` | Pretrained GVP-GNN weights |
77
+ | `OPENAI_API_KEY` | API key for GPT baseline evaluation |
78
+
79
+ ## πŸ”§ Training
80
+
81
+ SciCore-Mol follows a **three-stage training pipeline** (see figure above):
82
+
83
+ ### Stage 1: Component Pre-training
84
+
85
+ Pre-train each component independently before joint training.
86
+
87
+ - **GVP Encoder + MLP Adapter**: Align GVP molecular embeddings to LLM hidden space.
88
+ ```bash
89
+ bash scripts/run/gvp_mlp_pretrain_qwen.sh
90
+ ```
91
+ - **Reaction Transformer (Layer2)**: Train on reaction data for yield prediction and embedding reconstruction.
92
+ ```bash
93
+ python scripts/layer2/train_layer2.py \
94
+ --config scripts/layer2/layer2_train_config.yaml
95
+ ```
96
+
97
+ ### Stage 2: Cross-Modal Alignment Training
98
+
99
+ Joint SFT training with all modules connected. The LLM learns to invoke external modules via special `<mol>` tokens.
100
+
101
+ ```bash
102
+ # Configure training in configs/qwen3_sft_epoch2_1.yaml
103
+ # Uses DeepSpeed ZeRO-3 for multi-GPU training
104
+ torchrun --nproc_per_node=4 \
105
+ cotrain_llm_diffusion/train_step1_llm.py \
106
+ --config configs/qwen3_sft_epoch2_1.yaml
107
+ ```
108
+
109
+ **Key config fields** (in `configs/qwen3_sft_epoch2_*.yaml`):
110
+ - `paths.llm_name_or_path`: Base LLM checkpoint
111
+ - `paths.gnn_state_dict_path`: Pretrained GVP weights
112
+ - `paths.deepspeed_config`: DeepSpeed config (ZeRO-2 or ZeRO-3)
113
+ - `training.freeze_strategy`: Control which modules are frozen/trainable
114
+
115
+ ### Stage 3: Task-Specific Fine-tuning
116
+
117
+ Fine-tune Layer2 (Reaction Transformer) on downstream tasks with configurable module freezing:
118
+
119
+ ```bash
120
+ python scripts/layer2/train_layer2.py \
121
+ --config scripts/layer2/layer2_train_config_stage2_v7b.yaml
122
+ ```
123
+
124
+ After training, split the checkpoint into LLM and extra components:
125
+ ```bash
126
+ python scripts/ckpt/split_llm_extras.py \
127
+ --checkpoint_path ${CHECKPOINT_DIR}/your-checkpoint/ \
128
+ --output_dir ${CHECKPOINT_DIR}/your-checkpoint/
129
+ ```
130
+
131
+ ## πŸ“Š Evaluation
132
+
133
+ ### ChemBench4K (Product / Retrosynthesis / Yield / Captioning)
134
+
135
+ ```bash
136
+ # Evaluate all 5 tasks with logprob scoring
137
+ bash scripts/run/run_chembench_all_tasks.sh
138
+
139
+ # Or run individual tasks:
140
+ python scripts/eval/eval_layer2_chembench.py \
141
+ --checkpoint_dir ${CHECKPOINT_DIR}/your-checkpoint \
142
+ --task product \
143
+ --output_dir eval_results/chembench/
144
+ ```
145
+
146
+ ### MMLU Chemistry Subsets (5 subjects)
147
+
148
+ ```bash
149
+ python scripts/eval/eval_mmlu_interns1mini_5subsets.py \
150
+ --model_path ${MODEL_DIR}/your-model \
151
+ --output_dir eval_results/mmlu/
152
+ ```
153
+
154
+ ### ORD Reaction Prediction (Full Pipeline)
155
+
156
+ ```bash
157
+ # Run Layer2-LLM integrated pipeline
158
+ bash scripts/layer2_llm/run_full_pipeline.sh
159
+
160
+ # Score predictions
161
+ python scripts/postprocess/score_only.py \
162
+ --pred_dir eval_results/ord/
163
+ ```
164
+
165
+ ### SMolInstruct (7 molecular tasks)
166
+
167
+ ```bash
168
+ # Automated multi-task evaluation with GPU scheduling
169
+ bash scripts/run/eval_smol_task_list.sh
170
+ ```
171
+
172
+ ### Drug Optimization (ADMET scoring)
173
+
174
+ ```bash
175
+ # LLM-based drug optimization
176
+ python eval/drug_optim/eval_admet.py \
177
+ --config eval/drug_optim/config/llm_cpt_sft.yaml
178
+
179
+ # Diffusion-based drug optimization
180
+ python eval/drug_optim/eval_diffusion.py \
181
+ --config eval/drug_optim/config/diffusion_sft.yaml
182
+ ```
183
+
184
+ ## πŸ“ Repository Structure
185
+
186
+ ```
187
+ SciCore-Mol/
188
+ β”œβ”€β”€ configs/ # Training and DeepSpeed configs
189
+ β”‚ β”œβ”€β”€ qwen3_sft_epoch2_*.yaml # Stage 2 SFT configs
190
+ β”‚ β”œβ”€β”€ deepspeed_zero*.json # DeepSpeed ZeRO-2/3 configs
191
+ β”‚ └── env.example.sh # Environment variable template
192
+ β”œβ”€β”€ cotrain_llm_diffusion/ # Stage 2: LLM-Diffusion co-training
193
+ β”‚ β”œβ”€β”€ train_step1_llm.py # Joint SFT training script
194
+ β”‚ └── generate_reasoning*.py # Diffusion data generation
195
+ β”œβ”€β”€ eval/ # Evaluation suite
196
+ β”‚ β”œβ”€β”€ drug_optim/ # Drug optimization (ADMET)
197
+ β”‚ β”œβ”€β”€ eval_smolinstruct.py # SMolInstruct benchmark
198
+ β”‚ └── eval_*.py # Other benchmarks
199
+ β”œβ”€β”€ modules/ # Core model components
200
+ β”‚ β”œβ”€β”€ mol_aware_lm.py # MolAware language model wrapper
201
+ β”‚ β”œβ”€β”€ model_init.py # Model/tokenizer initialization
202
+ β”‚ β”œβ”€β”€ data_loader.py # Data loading & mol-span processing
203
+ β”‚ β”œβ”€β”€ gnn.py # GVP-GNN encoder
204
+ β”‚ β”œβ”€β”€ mlp.py # MLP adapter (GVP β†’ LLM)
205
+ β”‚ β”œβ”€β”€ tools.py # SMILES extraction & NER tools
206
+ β”‚ β”œβ”€β”€ layer2_component/ # Reaction Transformer module
207
+ β”‚ β”‚ β”œβ”€β”€ model.py # Transformer encoder architecture
208
+ β”‚ β”‚ β”œβ”€β”€ Layer2Trainer.py # Training loop
209
+ β”‚ β”‚ └── Layer2Inferer.py # Inference & embedding generation
210
+ β”‚ └── ldmol_component/ # Diffusion decoder module
211
+ β”‚ β”œβ”€β”€ LDMolTrainer.py # Diffusion training
212
+ β”‚ β”œβ”€β”€ LDMolInferer.py # Molecule generation
213
+ β”‚ └── DiT/ # DiT backbone
214
+ β”œβ”€β”€ scripts/
215
+ β”‚ β”œβ”€β”€ train/ # Training entry scripts
216
+ β”‚ β”œβ”€β”€ eval/ # Evaluation scripts
217
+ β”‚ β”œβ”€β”€ layer2/ # Layer2 configs & training
218
+ β”‚ β”œβ”€β”€ layer2_llm/ # Layer2-LLM integration pipeline
219
+ β”‚ β”œβ”€β”€ preprocess/ # Data preprocessing
220
+ β”‚ β”œβ”€β”€ postprocess/ # Scoring & post-processing
221
+ β”‚ └── ckpt/ # Checkpoint utilities
222
+ β”œβ”€β”€ utils/ # Shared utilities (metrics, SMILES)
223
+ β”œβ”€β”€ vendor/gvp-pytorch-main/ # GVP-GNN (vendored dependency)
224
+ β”œβ”€β”€ figs/ # Paper figures
225
+ β”œβ”€β”€ LICENSE-MIT # MIT License
226
+ β”œβ”€β”€ LICENSE-APACHE # Apache 2.0 License
227
+ β”œβ”€β”€ pyproject.toml # Project & dependency config
228
+ └── README.md
229
+ ```
230
+
231
+ ## πŸ“„ Acknowledgement
232
+
233
+ - [GVP-GNN](https://github.com/drorlab/gvp-pytorch) β€” Geometric Vector Perceptron for molecular structure encoding
234
+ - [LDMol](https://github.com/jinhojsk515/LDMol) β€” Latent Diffusion for molecular generation
235
+ - [SMolInstruct](https://github.com/osu-nlp-group/SMolInstruct) β€” Molecular instruction tuning benchmark
236
+ - [ChemBench](https://github.com/lamalab-org/chem-bench) β€” Chemistry benchmark suite
237
+
238
+ ## πŸ₯° Citation
239
+
240
+ ```bibtex
241
+ @article{chen2026scicoremol,
242
+ title={SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules},
243
+ author={},
244
+ journal={arXiv preprint arXiv:XXXX.XXXXX},
245
+ year={2026}
246
+ }
247
+ ```
248
+
249
+ ## πŸ“§ Contact
250
+
251
+ If you have questions, suggestions, or bug reports, please open an issue or email:
252
+ ```
253
+ chenyuxuan225@gmail.com
254
+ ```
255
+
256
+ ## πŸ“œ License
257
+
258
+ This project is dual-licensed under [MIT](LICENSE-MIT) and [Apache 2.0](LICENSE-APACHE).