Mol-LLM / README.md
Chanhui's picture
Update README.md
a4f182c verified
---
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
license: cc-by-nc-3.0
datasets:
- KU-AGI/Mol-LLM
language:
- en
metrics:
- bleu
- meteor
- rouge
- roc_auc
- mae
base_model:
- mistralai/Mistral-7B-Instruct-v0.3
---
# Mol-LLM: Multimodal Generalist Molecular LLM
Mol-LLM is a multimodal generalist molecular large language model for chemistry that jointly uses molecular 1D sequences and 2D molecular graphs to solve a wide range of molecular tasks in a single unified framework.
It introduces **Molecular structure Preference Optimization (MolPO)** to force the LLM to prefer correct molecular graphs over perturbed ones, resolving the **“graph-bypass”** issue common in prior multimodal molecular LLMs.
## Model summary
- **Backbone**: Mistral-7B-Instruct-v0.3.
- **Modalities**:
- Text (natural language instructions).
- 1D molecular sequences (SELFIES; SMILES supported via translation).
- 2D molecular graphs encoded by a hybrid GNN (GINE + TokenGT).
- **Architecture**: Mol-LLM uses a BLIP-2–style architecture where Q-Former (32 query tokens) projects graph embeddings into the LLM token space.
- **LLM**:
- Mistral-7B-Instruct-v0.3 as text backbone.
- Extended tokenizer with SELFIES and numeric tokens, plus task tags for heterogeneous outputs (discrete labels, floats, descriptions).
- **Hybrid graph encoder**:
- GINE for local structural patterns.
- TokenGT (transformer-based) for global structural dependencies and large graphs.
- Both encoders produce graph and node (and edge) embeddings; concatenated embeddings are fed into the Q-Former.
- **Q-Former**:
- 5-layer SciBERT-style transformer with 32 learnable queries.
- Cross-attends to graph embeddings and outputs fixed-length tokens appended after SELFIES tokens in the LLM input.
- Selected over an MLP projector due to better alignment and graph-token efficiency.
- **Tokenizer extensions**: 3K SELFIES tokens, numeric tokens, and task tags for `[SELFIES]`, `[BOOLEAN]`, `[FLOAT]`, `[DESCRIPTION]`, and reaction-direction symbols.
- **Training data**: ~3.3M instruction-tuning examples over 27 tasks, with ~40K held-out test instances.
Mol-LLM is positioned as a state-of-the-art or comparable **generalist** molecular LLM on the most comprehensive benchmark suite evaluated so far, including out-of-distribution (OOD) settings.
## Intended use
Mol-LLM is intended to solve **molecular tasks** via a single multitask model.
Supported task families:
- **Reaction prediction**:
- Forward synthesis (product prediction, FS)
- Retrosynthesis (reactant prediction, RS)
- Reagent prediction (RP)
- **Property prediction**:
- Regression: LogS, LogD, HOMO, LUMO, HOMO–LUMO gap
- Classification: BACE, BBBP, ClinTox, HIV, SIDER
- **Text–molecule tasks**:
- Description-guided molecule generation
- Molecule captioning
- IUPAC/SELFIES/formula translation as auxiliary tasks