Instructions to use openbmb/SciCore-Mol with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Inference
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,258 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
pipeline_tag: text-classification
|
| 7 |
+
---
|
| 8 |
+
<div align="center">
|
| 9 |
+
<h1>SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules</h1>
|
| 10 |
+
|
| 11 |
+
<a href='SciCore_Mol_Technical_Report.pdf'><img src='https://img.shields.io/badge/Paper-Technical_Report-red'></a>
|
| 12 |
+
<a href='https://huggingface.co/openbmb/SciCore-Mol'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a>
|
| 13 |
+
<a href='https://chenyxyx-scicore-mol.hf.space'><img src='https://img.shields.io/badge/Demo-Hugging_Face_Space-0f8b7d'></a>
|
| 14 |
+
|
| 15 |
+
**Yuxuan Chen**<sup>1</sup>, **Changwei Lv**<sup>2</sup>, **Yunduo Xiao**<sup>2</sup>, **Yukun Yan**<sup>2</sup>, **Zheni Zeng**<sup>*3</sup>, and **Zhiyuan Liu**<sup>2</sup>
|
| 16 |
+
|
| 17 |
+
<sup>1</sup>School of Electronic and Computer Engineering, Peking University, Shenzhen, China<br>
|
| 18 |
+
<sup>2</sup>Tsinghua University, Beijing, China<br>
|
| 19 |
+
<sup>3</sup>School of Intelligence Science and Technology, Nanjing University, Nanjing, China<br>
|
| 20 |
+
<sup>*</sup>Corresponding author: zengzn@nju.edu.cn
|
| 21 |
+
|
| 22 |
+
</div>
|
| 23 |
+
|
| 24 |
+
## π Introduction
|
| 25 |
+
|
| 26 |
+
Large language models (LLMs) are increasingly popular in professional domains, while meet a fundamental cognitive tension when dealing with heterogeneous scientific data: LLMs are designed for discrete natural language symbolic sequences, whereas scientific entities represented by molecules are inherently topological and geometric. Forcing these structures into linear text inevitably results in information loss and semantic noise interferes with the LLM's cognitive reasoning.
|
| 27 |
+
|
| 28 |
+
We propose **SciCore-Mol**, a novel paradigm to augment the LLM with pluggable external cognitive modules, including a **GVP encoder**, a **diffusion generator**, and a **numerical-sensitive Transformer** (Reaction Transformer). This architecture preserves the general capabilities while providing specialized molecular perception for LLMs. With a two-stage alignment mechanism, external modules are invoked via special tokens and fused at the hidden-state level, enabling the LLM to deeply understand molecular information without sacrificing its core reasoning process.
|
| 29 |
+
|
| 30 |
+
<p align="center"><img src="figs/fig3.png" width="90%"></p>
|
| 31 |
+
|
| 32 |
+
## βοΈ Setup
|
| 33 |
+
|
| 34 |
+
### Prerequisites
|
| 35 |
+
|
| 36 |
+
- Python 3.10
|
| 37 |
+
- CUDA 12.1
|
| 38 |
+
- 8x A800/A100 80GB GPUs (recommended for full training)
|
| 39 |
+
|
| 40 |
+
### Installation
|
| 41 |
+
|
| 42 |
+
```bash
|
| 43 |
+
git clone https://github.com/ChenYX24/SciCore-Mol.git
|
| 44 |
+
cd SciCore-Mol
|
| 45 |
+
|
| 46 |
+
# Option A: Install with uv (recommended)
|
| 47 |
+
pip install uv
|
| 48 |
+
uv sync
|
| 49 |
+
uv sync --extra graph # GVP-GNN dependencies (torch-geometric, torch-scatter, torch-cluster)
|
| 50 |
+
uv sync --extra flashattn # FlashAttention (requires CUDA)
|
| 51 |
+
uv sync --group train # DeepSpeed for distributed training
|
| 52 |
+
|
| 53 |
+
# Option B: Install with pip
|
| 54 |
+
python -m venv .venv
|
| 55 |
+
source .venv/bin/activate
|
| 56 |
+
pip install -e .
|
| 57 |
+
pip install -e ".[graph]" # optional: GVP-GNN
|
| 58 |
+
pip install -e ".[flashattn]" # optional: FlashAttention
|
| 59 |
+
pip install deepspeed swanlab # optional: distributed training
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
### Environment Variables
|
| 63 |
+
|
| 64 |
+
```bash
|
| 65 |
+
cp configs/env.example.sh configs/env.sh
|
| 66 |
+
# Edit configs/env.sh to set your paths, then:
|
| 67 |
+
source configs/env.sh
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
| Variable | Description |
|
| 71 |
+
|----------|-------------|
|
| 72 |
+
| `SCICORE_ROOT` | Project root directory |
|
| 73 |
+
| `MODEL_DIR` | Base model directory (e.g., Qwen3-8B) |
|
| 74 |
+
| `CHECKPOINT_DIR` | Trained checkpoint directory |
|
| 75 |
+
| `DATA_DIR` | Training and evaluation data |
|
| 76 |
+
| `GVP_CHECKPOINT` | Pretrained GVP-GNN weights |
|
| 77 |
+
| `OPENAI_API_KEY` | API key for GPT baseline evaluation |
|
| 78 |
+
|
| 79 |
+
## π§ Training
|
| 80 |
+
|
| 81 |
+
SciCore-Mol follows a **three-stage training pipeline** (see figure above):
|
| 82 |
+
|
| 83 |
+
### Stage 1: Component Pre-training
|
| 84 |
+
|
| 85 |
+
Pre-train each component independently before joint training.
|
| 86 |
+
|
| 87 |
+
- **GVP Encoder + MLP Adapter**: Align GVP molecular embeddings to LLM hidden space.
|
| 88 |
+
```bash
|
| 89 |
+
bash scripts/run/gvp_mlp_pretrain_qwen.sh
|
| 90 |
+
```
|
| 91 |
+
- **Reaction Transformer (Layer2)**: Train on reaction data for yield prediction and embedding reconstruction.
|
| 92 |
+
```bash
|
| 93 |
+
python scripts/layer2/train_layer2.py \
|
| 94 |
+
--config scripts/layer2/layer2_train_config.yaml
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
### Stage 2: Cross-Modal Alignment Training
|
| 98 |
+
|
| 99 |
+
Joint SFT training with all modules connected. The LLM learns to invoke external modules via special `<mol>` tokens.
|
| 100 |
+
|
| 101 |
+
```bash
|
| 102 |
+
# Configure training in configs/qwen3_sft_epoch2_1.yaml
|
| 103 |
+
# Uses DeepSpeed ZeRO-3 for multi-GPU training
|
| 104 |
+
torchrun --nproc_per_node=4 \
|
| 105 |
+
cotrain_llm_diffusion/train_step1_llm.py \
|
| 106 |
+
--config configs/qwen3_sft_epoch2_1.yaml
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
**Key config fields** (in `configs/qwen3_sft_epoch2_*.yaml`):
|
| 110 |
+
- `paths.llm_name_or_path`: Base LLM checkpoint
|
| 111 |
+
- `paths.gnn_state_dict_path`: Pretrained GVP weights
|
| 112 |
+
- `paths.deepspeed_config`: DeepSpeed config (ZeRO-2 or ZeRO-3)
|
| 113 |
+
- `training.freeze_strategy`: Control which modules are frozen/trainable
|
| 114 |
+
|
| 115 |
+
### Stage 3: Task-Specific Fine-tuning
|
| 116 |
+
|
| 117 |
+
Fine-tune Layer2 (Reaction Transformer) on downstream tasks with configurable module freezing:
|
| 118 |
+
|
| 119 |
+
```bash
|
| 120 |
+
python scripts/layer2/train_layer2.py \
|
| 121 |
+
--config scripts/layer2/layer2_train_config_stage2_v7b.yaml
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
After training, split the checkpoint into LLM and extra components:
|
| 125 |
+
```bash
|
| 126 |
+
python scripts/ckpt/split_llm_extras.py \
|
| 127 |
+
--checkpoint_path ${CHECKPOINT_DIR}/your-checkpoint/ \
|
| 128 |
+
--output_dir ${CHECKPOINT_DIR}/your-checkpoint/
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
## π Evaluation
|
| 132 |
+
|
| 133 |
+
### ChemBench4K (Product / Retrosynthesis / Yield / Captioning)
|
| 134 |
+
|
| 135 |
+
```bash
|
| 136 |
+
# Evaluate all 5 tasks with logprob scoring
|
| 137 |
+
bash scripts/run/run_chembench_all_tasks.sh
|
| 138 |
+
|
| 139 |
+
# Or run individual tasks:
|
| 140 |
+
python scripts/eval/eval_layer2_chembench.py \
|
| 141 |
+
--checkpoint_dir ${CHECKPOINT_DIR}/your-checkpoint \
|
| 142 |
+
--task product \
|
| 143 |
+
--output_dir eval_results/chembench/
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
### MMLU Chemistry Subsets (5 subjects)
|
| 147 |
+
|
| 148 |
+
```bash
|
| 149 |
+
python scripts/eval/eval_mmlu_interns1mini_5subsets.py \
|
| 150 |
+
--model_path ${MODEL_DIR}/your-model \
|
| 151 |
+
--output_dir eval_results/mmlu/
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
### ORD Reaction Prediction (Full Pipeline)
|
| 155 |
+
|
| 156 |
+
```bash
|
| 157 |
+
# Run Layer2-LLM integrated pipeline
|
| 158 |
+
bash scripts/layer2_llm/run_full_pipeline.sh
|
| 159 |
+
|
| 160 |
+
# Score predictions
|
| 161 |
+
python scripts/postprocess/score_only.py \
|
| 162 |
+
--pred_dir eval_results/ord/
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
### SMolInstruct (7 molecular tasks)
|
| 166 |
+
|
| 167 |
+
```bash
|
| 168 |
+
# Automated multi-task evaluation with GPU scheduling
|
| 169 |
+
bash scripts/run/eval_smol_task_list.sh
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
### Drug Optimization (ADMET scoring)
|
| 173 |
+
|
| 174 |
+
```bash
|
| 175 |
+
# LLM-based drug optimization
|
| 176 |
+
python eval/drug_optim/eval_admet.py \
|
| 177 |
+
--config eval/drug_optim/config/llm_cpt_sft.yaml
|
| 178 |
+
|
| 179 |
+
# Diffusion-based drug optimization
|
| 180 |
+
python eval/drug_optim/eval_diffusion.py \
|
| 181 |
+
--config eval/drug_optim/config/diffusion_sft.yaml
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
## π Repository Structure
|
| 185 |
+
|
| 186 |
+
```
|
| 187 |
+
SciCore-Mol/
|
| 188 |
+
βββ configs/ # Training and DeepSpeed configs
|
| 189 |
+
β βββ qwen3_sft_epoch2_*.yaml # Stage 2 SFT configs
|
| 190 |
+
β βββ deepspeed_zero*.json # DeepSpeed ZeRO-2/3 configs
|
| 191 |
+
β βββ env.example.sh # Environment variable template
|
| 192 |
+
βββ cotrain_llm_diffusion/ # Stage 2: LLM-Diffusion co-training
|
| 193 |
+
β βββ train_step1_llm.py # Joint SFT training script
|
| 194 |
+
β βββ generate_reasoning*.py # Diffusion data generation
|
| 195 |
+
βββ eval/ # Evaluation suite
|
| 196 |
+
β βββ drug_optim/ # Drug optimization (ADMET)
|
| 197 |
+
β βββ eval_smolinstruct.py # SMolInstruct benchmark
|
| 198 |
+
β βββ eval_*.py # Other benchmarks
|
| 199 |
+
βββ modules/ # Core model components
|
| 200 |
+
β βββ mol_aware_lm.py # MolAware language model wrapper
|
| 201 |
+
β βββ model_init.py # Model/tokenizer initialization
|
| 202 |
+
β βββ data_loader.py # Data loading & mol-span processing
|
| 203 |
+
β βββ gnn.py # GVP-GNN encoder
|
| 204 |
+
β βββ mlp.py # MLP adapter (GVP β LLM)
|
| 205 |
+
β βββ tools.py # SMILES extraction & NER tools
|
| 206 |
+
β βββ layer2_component/ # Reaction Transformer module
|
| 207 |
+
β β βββ model.py # Transformer encoder architecture
|
| 208 |
+
β β βββ Layer2Trainer.py # Training loop
|
| 209 |
+
β β βββ Layer2Inferer.py # Inference & embedding generation
|
| 210 |
+
β βββ ldmol_component/ # Diffusion decoder module
|
| 211 |
+
β βββ LDMolTrainer.py # Diffusion training
|
| 212 |
+
β βββ LDMolInferer.py # Molecule generation
|
| 213 |
+
β βββ DiT/ # DiT backbone
|
| 214 |
+
βββ scripts/
|
| 215 |
+
β βββ train/ # Training entry scripts
|
| 216 |
+
β βββ eval/ # Evaluation scripts
|
| 217 |
+
β βββ layer2/ # Layer2 configs & training
|
| 218 |
+
β βββ layer2_llm/ # Layer2-LLM integration pipeline
|
| 219 |
+
β βββ preprocess/ # Data preprocessing
|
| 220 |
+
β βββ postprocess/ # Scoring & post-processing
|
| 221 |
+
β βββ ckpt/ # Checkpoint utilities
|
| 222 |
+
βββ utils/ # Shared utilities (metrics, SMILES)
|
| 223 |
+
βββ vendor/gvp-pytorch-main/ # GVP-GNN (vendored dependency)
|
| 224 |
+
βββ figs/ # Paper figures
|
| 225 |
+
βββ LICENSE-MIT # MIT License
|
| 226 |
+
βββ LICENSE-APACHE # Apache 2.0 License
|
| 227 |
+
βββ pyproject.toml # Project & dependency config
|
| 228 |
+
βββ README.md
|
| 229 |
+
```
|
| 230 |
+
|
| 231 |
+
## π Acknowledgement
|
| 232 |
+
|
| 233 |
+
- [GVP-GNN](https://github.com/drorlab/gvp-pytorch) β Geometric Vector Perceptron for molecular structure encoding
|
| 234 |
+
- [LDMol](https://github.com/jinhojsk515/LDMol) β Latent Diffusion for molecular generation
|
| 235 |
+
- [SMolInstruct](https://github.com/osu-nlp-group/SMolInstruct) β Molecular instruction tuning benchmark
|
| 236 |
+
- [ChemBench](https://github.com/lamalab-org/chem-bench) β Chemistry benchmark suite
|
| 237 |
+
|
| 238 |
+
## π₯° Citation
|
| 239 |
+
|
| 240 |
+
```bibtex
|
| 241 |
+
@article{chen2026scicoremol,
|
| 242 |
+
title={SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules},
|
| 243 |
+
author={},
|
| 244 |
+
journal={arXiv preprint arXiv:XXXX.XXXXX},
|
| 245 |
+
year={2026}
|
| 246 |
+
}
|
| 247 |
+
```
|
| 248 |
+
|
| 249 |
+
## π§ Contact
|
| 250 |
+
|
| 251 |
+
If you have questions, suggestions, or bug reports, please open an issue or email:
|
| 252 |
+
```
|
| 253 |
+
chenyuxuan225@gmail.com
|
| 254 |
+
```
|
| 255 |
+
|
| 256 |
+
## π License
|
| 257 |
+
|
| 258 |
+
This project is dual-licensed under [MIT](LICENSE-MIT) and [Apache 2.0](LICENSE-APACHE).
|