Instructions to use openbmb/SciCore-Mol with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Inference
File size: 7,597 Bytes
7e2b0e8 4fc9fc0 7e2b0e8 ac9243a 7e2b0e8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | ---
license: apache-2.0
language:
- en
- zh
pipeline_tag: text-generation
---
<div align="center">
<h1>SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules</h1>
<a href='https://github.com/OpenBMB/SciCore-Mol/blob/main/SciCore_Mol_Technical_Report.pdf'><img src='https://img.shields.io/badge/Paper-Technical_Report-red'></a>
<a href="https://github.com/OpenBMB/SciCore-Mol"><img src="https://img.shields.io/badge/GitHub-OpenBMB%2FSciCore--Mol-181717?style=flat&logo=github&logoColor=white" alt="OpenBMB/SciCore-Mol"></a>
<a href='https://huggingface.co/openbmb/SciCore-Mol'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a>
<a href='https://chenyxyx-scicore-mol.hf.space'><img src='https://img.shields.io/badge/Demo-Hugging_Face_Space-0f8b7d'></a>
**Yuxuan Chen**<sup>1</sup>, **Changwei Lv**<sup>2</sup>, **Yunduo Xiao**<sup>2</sup>, **Yukun Yan**<sup>2</sup>, **Zheni Zeng**<sup>*3</sup>, and **Zhiyuan Liu**<sup>2</sup>
<sup>1</sup>School of Electronic and Computer Engineering, Peking University, Shenzhen, China<br>
<sup>2</sup>Tsinghua University, Beijing, China<br>
<sup>3</sup>School of Intelligence Science and Technology, Nanjing University, Nanjing, China<br>
<sup>*</sup>Corresponding author: zengzn@nju.edu.cn
</div>
## π Introduction
Large language models (LLMs) are increasingly popular in professional domains, while meet a fundamental cognitive tension when dealing with heterogeneous scientific data: LLMs are designed for discrete natural language symbolic sequences, whereas scientific entities represented by molecules are inherently topological and geometric. Forcing these structures into linear text inevitably results in information loss and semantic noise interferes with the LLM's cognitive reasoning.
We propose **SciCore-Mol**, a novel paradigm to augment the LLM with pluggable external cognitive modules, including a **GVP encoder**, a **diffusion generator**, and a **numerical-sensitive Transformer** (Reaction Transformer). This architecture preserves the general capabilities while providing specialized molecular perception for LLMs. With a two-stage alignment mechanism, external modules are invoked via special tokens and fused at the hidden-state level, enabling the LLM to deeply understand molecular information without sacrificing its core reasoning process.
## βοΈ Setup
### Prerequisites
- Python 3.10
- CUDA 12.1
- 8x A800/A100 80GB GPUs (recommended for full training)
### Installation
```bash
git clone https://github.com/ChenYX24/SciCore-Mol.git
cd SciCore-Mol
# Option A: Install with uv (recommended)
pip install uv
uv sync
uv sync --extra graph # GVP-GNN dependencies (torch-geometric, torch-scatter, torch-cluster)
uv sync --extra flashattn # FlashAttention (requires CUDA)
uv sync --group train # DeepSpeed for distributed training
# Option B: Install with pip
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -e ".[graph]" # optional: GVP-GNN
pip install -e ".[flashattn]" # optional: FlashAttention
pip install deepspeed swanlab # optional: distributed training
```
### Environment Variables
```bash
cp configs/env.example.sh configs/env.sh
# Edit configs/env.sh to set your paths, then:
source configs/env.sh
```
| Variable | Description |
|----------|-------------|
| `SCICORE_ROOT` | Project root directory |
| `MODEL_DIR` | Base model directory (e.g., Qwen3-8B) |
| `CHECKPOINT_DIR` | Trained checkpoint directory |
| `DATA_DIR` | Training and evaluation data |
| `GVP_CHECKPOINT` | Pretrained GVP-GNN weights |
| `OPENAI_API_KEY` | API key for GPT baseline evaluation |
## π§ Training
SciCore-Mol follows a **three-stage training pipeline** (see figure above):
### Stage 1: Component Pre-training
Pre-train each component independently before joint training.
- **GVP Encoder + MLP Adapter**: Align GVP molecular embeddings to LLM hidden space.
```bash
bash scripts/run/gvp_mlp_pretrain_qwen.sh
```
- **Reaction Transformer (Layer2)**: Train on reaction data for yield prediction and embedding reconstruction.
```bash
python scripts/layer2/train_layer2.py \
--config scripts/layer2/layer2_train_config.yaml
```
### Stage 2: Cross-Modal Alignment Training
Joint SFT training with all modules connected. The LLM learns to invoke external modules via special `<mol>` tokens.
```bash
# Configure training in configs/qwen3_sft_epoch2_1.yaml
# Uses DeepSpeed ZeRO-3 for multi-GPU training
torchrun --nproc_per_node=4 \
cotrain_llm_diffusion/train_step1_llm.py \
--config configs/qwen3_sft_epoch2_1.yaml
```
**Key config fields** (in `configs/qwen3_sft_epoch2_*.yaml`):
- `paths.llm_name_or_path`: Base LLM checkpoint
- `paths.gnn_state_dict_path`: Pretrained GVP weights
- `paths.deepspeed_config`: DeepSpeed config (ZeRO-2 or ZeRO-3)
- `training.freeze_strategy`: Control which modules are frozen/trainable
### Stage 3: Task-Specific Fine-tuning
Fine-tune Layer2 (Reaction Transformer) on downstream tasks with configurable module freezing:
```bash
python scripts/layer2/train_layer2.py \
--config scripts/layer2/layer2_train_config_stage2_v7b.yaml
```
After training, split the checkpoint into LLM and extra components:
```bash
python scripts/ckpt/split_llm_extras.py \
--checkpoint_path ${CHECKPOINT_DIR}/your-checkpoint/ \
--output_dir ${CHECKPOINT_DIR}/your-checkpoint/
```
## π Evaluation
### ChemBench4K (Product / Retrosynthesis / Yield / Captioning)
```bash
# Evaluate all 5 tasks with logprob scoring
bash scripts/run/run_chembench_all_tasks.sh
# Or run individual tasks:
python scripts/eval/eval_layer2_chembench.py \
--checkpoint_dir ${CHECKPOINT_DIR}/your-checkpoint \
--task product \
--output_dir eval_results/chembench/
```
### MMLU Chemistry Subsets (5 subjects)
```bash
python scripts/eval/eval_mmlu_interns1mini_5subsets.py \
--model_path ${MODEL_DIR}/your-model \
--output_dir eval_results/mmlu/
```
### ORD Reaction Prediction (Full Pipeline)
```bash
# Run Layer2-LLM integrated pipeline
bash scripts/layer2_llm/run_full_pipeline.sh
# Score predictions
python scripts/postprocess/score_only.py \
--pred_dir eval_results/ord/
```
### SMolInstruct (7 molecular tasks)
```bash
# Automated multi-task evaluation with GPU scheduling
bash scripts/run/eval_smol_task_list.sh
```
### Drug Optimization (ADMET scoring)
```bash
# LLM-based drug optimization
python eval/drug_optim/eval_admet.py \
--config eval/drug_optim/config/llm_cpt_sft.yaml
# Diffusion-based drug optimization
python eval/drug_optim/eval_diffusion.py \
--config eval/drug_optim/config/diffusion_sft.yaml
```
## π Acknowledgement
- [GVP-GNN](https://github.com/drorlab/gvp-pytorch) β Geometric Vector Perceptron for molecular structure encoding
- [LDMol](https://github.com/jinhojsk515/LDMol) β Latent Diffusion for molecular generation
- [SMolInstruct](https://github.com/osu-nlp-group/SMolInstruct) β Molecular instruction tuning benchmark
- [ChemBench](https://github.com/lamalab-org/chem-bench) β Chemistry benchmark suite
## π₯° Citation
```bibtex
@article{chen2026scicoremol,
title={SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules},
author={},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}
```
## π§ Contact
If you have questions, suggestions, or bug reports, please open an issue or email:
```
chenyuxuan225@gmail.com
```
## π License
This project is dual-licensed under [MIT](LICENSE-MIT) and [Apache 2.0](LICENSE-APACHE). |