Mol-LLM / README.md
Chanhui's picture
Update README.md
a4f182c verified
metadata
tags:
  - model_hub_mixin
  - pytorch_model_hub_mixin
license: cc-by-nc-3.0
datasets:
  - KU-AGI/Mol-LLM
language:
  - en
metrics:
  - bleu
  - meteor
  - rouge
  - roc_auc
  - mae
base_model:
  - mistralai/Mistral-7B-Instruct-v0.3

Mol-LLM: Multimodal Generalist Molecular LLM

Mol-LLM is a multimodal generalist molecular large language model for chemistry that jointly uses molecular 1D sequences and 2D molecular graphs to solve a wide range of molecular tasks in a single unified framework. It introduces Molecular structure Preference Optimization (MolPO) to force the LLM to prefer correct molecular graphs over perturbed ones, resolving the “graph-bypass” issue common in prior multimodal molecular LLMs.

Model summary

  • Backbone: Mistral-7B-Instruct-v0.3.
  • Modalities:
    • Text (natural language instructions).
    • 1D molecular sequences (SELFIES; SMILES supported via translation).
    • 2D molecular graphs encoded by a hybrid GNN (GINE + TokenGT).
  • Architecture: Mol-LLM uses a BLIP-2–style architecture where Q-Former (32 query tokens) projects graph embeddings into the LLM token space.
    • LLM:

      • Mistral-7B-Instruct-v0.3 as text backbone.
      • Extended tokenizer with SELFIES and numeric tokens, plus task tags for heterogeneous outputs (discrete labels, floats, descriptions).
    • Hybrid graph encoder:

      • GINE for local structural patterns.
      • TokenGT (transformer-based) for global structural dependencies and large graphs.
      • Both encoders produce graph and node (and edge) embeddings; concatenated embeddings are fed into the Q-Former.
    • Q-Former:

      • 5-layer SciBERT-style transformer with 32 learnable queries.
      • Cross-attends to graph embeddings and outputs fixed-length tokens appended after SELFIES tokens in the LLM input.
      • Selected over an MLP projector due to better alignment and graph-token efficiency.
  • Tokenizer extensions: 3K SELFIES tokens, numeric tokens, and task tags for [SELFIES], [BOOLEAN], [FLOAT], [DESCRIPTION], and reaction-direction symbols.
  • Training data: ~3.3M instruction-tuning examples over 27 tasks, with ~40K held-out test instances.

Mol-LLM is positioned as a state-of-the-art or comparable generalist molecular LLM on the most comprehensive benchmark suite evaluated so far, including out-of-distribution (OOD) settings.

Intended use

Mol-LLM is intended to solve molecular tasks via a single multitask model.

Supported task families:

  • Reaction prediction:

    • Forward synthesis (product prediction, FS)
    • Retrosynthesis (reactant prediction, RS)
    • Reagent prediction (RP)
  • Property prediction:

    • Regression: LogS, LogD, HOMO, LUMO, HOMO–LUMO gap
    • Classification: BACE, BBBP, ClinTox, HIV, SIDER
  • Text–molecule tasks:

    • Description-guided molecule generation
    • Molecule captioning
    • IUPAC/SELFIES/formula translation as auxiliary tasks