Mol-LLM / README.md

Chanhui

Update README.md

a4f182c verified about 1 month ago

preview code

raw

history blame contribute delete

2.99 kB

metadata

tags:
  - model_hub_mixin
  - pytorch_model_hub_mixin
license: cc-by-nc-3.0
datasets:
  - KU-AGI/Mol-LLM
language:
  - en
metrics:
  - bleu
  - meteor
  - rouge
  - roc_auc
  - mae
base_model:
  - mistralai/Mistral-7B-Instruct-v0.3

Mol-LLM: Multimodal Generalist Molecular LLM

Mol-LLM is a multimodal generalist molecular large language model for chemistry that jointly uses molecular 1D sequences and 2D molecular graphs to solve a wide range of molecular tasks in a single unified framework. It introduces Molecular structure Preference Optimization (MolPO) to force the LLM to prefer correct molecular graphs over perturbed ones, resolving the “graph-bypass” issue common in prior multimodal molecular LLMs.

Model summary

Backbone: Mistral-7B-Instruct-v0.3.
Modalities:
- Text (natural language instructions).
- 1D molecular sequences (SELFIES; SMILES supported via translation).
- 2D molecular graphs encoded by a hybrid GNN (GINE + TokenGT).
Architecture: Mol-LLM uses a BLIP-2–style architecture where Q-Former (32 query tokens) projects graph embeddings into the LLM token space.
- LLM:
  - Mistral-7B-Instruct-v0.3 as text backbone.
  - Extended tokenizer with SELFIES and numeric tokens, plus task tags for heterogeneous outputs (discrete labels, floats, descriptions).
- Hybrid graph encoder:
  - GINE for local structural patterns.
  - TokenGT (transformer-based) for global structural dependencies and large graphs.
  - Both encoders produce graph and node (and edge) embeddings; concatenated embeddings are fed into the Q-Former.
- Q-Former:
  - 5-layer SciBERT-style transformer with 32 learnable queries.
  - Cross-attends to graph embeddings and outputs fixed-length tokens appended after SELFIES tokens in the LLM input.
  - Selected over an MLP projector due to better alignment and graph-token efficiency.
Tokenizer extensions: 3K SELFIES tokens, numeric tokens, and task tags for [SELFIES], [BOOLEAN], [FLOAT], [DESCRIPTION], and reaction-direction symbols.
Training data: ~3.3M instruction-tuning examples over 27 tasks, with ~40K held-out test instances.

Mol-LLM is positioned as a state-of-the-art or comparable generalist molecular LLM on the most comprehensive benchmark suite evaluated so far, including out-of-distribution (OOD) settings.

Intended use

Mol-LLM is intended to solve molecular tasks via a single multitask model.

Supported task families:

Reaction prediction:
- Forward synthesis (product prediction, FS)
- Retrosynthesis (reactant prediction, RS)
- Reagent prediction (RP)
Property prediction:
- Regression: LogS, LogD, HOMO, LUMO, HOMO–LUMO gap
- Classification: BACE, BBBP, ClinTox, HIV, SIDER
Text–molecule tasks:
- Description-guided molecule generation
- Molecule captioning
- IUPAC/SELFIES/formula translation as auxiliary tasks