SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules

**Yuxuan Chen**
1, **Changwei Lv**
2, **Yunduo Xiao**
2, **Yukun Yan**
2, **Zheni Zeng**
*3, and **Zhiyuan Liu**
2
1School of Electronic and Computer Engineering, Peking University, Shenzhen, China
2Tsinghua University, Beijing, China
3School of Intelligence Science and Technology, Nanjing University, Nanjing, China
*Corresponding author: zengzn@nju.edu.cn
## 📖 Introduction
Large language models (LLMs) are increasingly popular in professional domains, while meet a fundamental cognitive tension when dealing with heterogeneous scientific data: LLMs are designed for discrete natural language symbolic sequences, whereas scientific entities represented by molecules are inherently topological and geometric. Forcing these structures into linear text inevitably results in information loss and semantic noise interferes with the LLM's cognitive reasoning.
We propose **SciCore-Mol**, a novel paradigm to augment the LLM with pluggable external cognitive modules, including a **GVP encoder**, a **diffusion generator**, and a **numerical-sensitive Transformer** (Reaction Transformer). This architecture preserves the general capabilities while providing specialized molecular perception for LLMs. With a two-stage alignment mechanism, external modules are invoked via special tokens and fused at the hidden-state level, enabling the LLM to deeply understand molecular information without sacrificing its core reasoning process.
## ⚙️ Setup
### Prerequisites
- Python 3.10
- CUDA 12.1
- 8x A800/A100 80GB GPUs (recommended for full training)
### Installation
```bash
git clone https://github.com/ChenYX24/SciCore-Mol.git
cd SciCore-Mol
# Option A: Install with uv (recommended)
pip install uv
uv sync
uv sync --extra graph # GVP-GNN dependencies (torch-geometric, torch-scatter, torch-cluster)
uv sync --extra flashattn # FlashAttention (requires CUDA)
uv sync --group train # DeepSpeed for distributed training
# Option B: Install with pip
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -e ".[graph]" # optional: GVP-GNN
pip install -e ".[flashattn]" # optional: FlashAttention
pip install deepspeed swanlab # optional: distributed training
```
### Environment Variables
```bash
cp configs/env.example.sh configs/env.sh
# Edit configs/env.sh to set your paths, then:
source configs/env.sh
```
| Variable | Description |
|----------|-------------|
| `SCICORE_ROOT` | Project root directory |
| `MODEL_DIR` | Base model directory (e.g., Qwen3-8B) |
| `CHECKPOINT_DIR` | Trained checkpoint directory |
| `DATA_DIR` | Training and evaluation data |
| `GVP_CHECKPOINT` | Pretrained GVP-GNN weights |
| `OPENAI_API_KEY` | API key for GPT baseline evaluation |
## 🔧 Training
SciCore-Mol follows a **three-stage training pipeline** (see figure above):
### Stage 1: Component Pre-training
Pre-train each component independently before joint training.
- **GVP Encoder + MLP Adapter**: Align GVP molecular embeddings to LLM hidden space.
```bash
bash scripts/run/gvp_mlp_pretrain_qwen.sh
```
- **Reaction Transformer (Layer2)**: Train on reaction data for yield prediction and embedding reconstruction.
```bash
python scripts/layer2/train_layer2.py \
--config scripts/layer2/layer2_train_config.yaml
```
### Stage 2: Cross-Modal Alignment Training
Joint SFT training with all modules connected. The LLM learns to invoke external modules via special `