KU-AGI
/

Mol-LLM

model_hub_mixin

pytorch_model_hub_mixin

Model card Files Files and versions

Mol-LLM / README.md

Chanhui's picture

Update README.md

a4f182c verified about 1 month ago

|

history blame contribute delete

2.99 kB

	---
	tags:
	- model_hub_mixin
	- pytorch_model_hub_mixin
	license: cc-by-nc-3.0
	datasets:
	- KU-AGI/Mol-LLM
	language:
	- en
	metrics:
	- bleu
	- meteor
	- rouge
	- roc_auc
	- mae
	base_model:
	- mistralai/Mistral-7B-Instruct-v0.3
	---


	# Mol-LLM: Multimodal Generalist Molecular LLM

	Mol-LLM is a multimodal generalist molecular large language model for chemistry that jointly uses molecular 1D sequences and 2D molecular graphs to solve a wide range of molecular tasks in a single unified framework.
	It introduces Molecular structure Preference Optimization (MolPO) to force the LLM to prefer correct molecular graphs over perturbed ones, resolving the “graph-bypass” issue common in prior multimodal molecular LLMs.

	## Model summary

	- Backbone: Mistral-7B-Instruct-v0.3.
	- Modalities:
	- Text (natural language instructions).
	- 1D molecular sequences (SELFIES; SMILES supported via translation).
	- 2D molecular graphs encoded by a hybrid GNN (GINE + TokenGT).
	- Architecture: Mol-LLM uses a BLIP-2–style architecture where Q-Former (32 query tokens) projects graph embeddings into the LLM token space.
	- LLM:
	- Mistral-7B-Instruct-v0.3 as text backbone.
	- Extended tokenizer with SELFIES and numeric tokens, plus task tags for heterogeneous outputs (discrete labels, floats, descriptions).

	- Hybrid graph encoder:
	- GINE for local structural patterns.
	- TokenGT (transformer-based) for global structural dependencies and large graphs.
	- Both encoders produce graph and node (and edge) embeddings; concatenated embeddings are fed into the Q-Former.

	- Q-Former:
	- 5-layer SciBERT-style transformer with 32 learnable queries.
	- Cross-attends to graph embeddings and outputs fixed-length tokens appended after SELFIES tokens in the LLM input.
	- Selected over an MLP projector due to better alignment and graph-token efficiency.
	- Tokenizer extensions: 3K SELFIES tokens, numeric tokens, and task tags for `[SELFIES]`, `[BOOLEAN]`, `[FLOAT]`, `[DESCRIPTION]`, and reaction-direction symbols.
	- Training data: ~3.3M instruction-tuning examples over 27 tasks, with ~40K held-out test instances.

	Mol-LLM is positioned as a state-of-the-art or comparable generalist molecular LLM on the most comprehensive benchmark suite evaluated so far, including out-of-distribution (OOD) settings.

	## Intended use

	Mol-LLM is intended to solve molecular tasks via a single multitask model.

	Supported task families:

	- Reaction prediction:
	- Forward synthesis (product prediction, FS)
	- Retrosynthesis (reactant prediction, RS)
	- Reagent prediction (RP)

	- Property prediction:
	- Regression: LogS, LogD, HOMO, LUMO, HOMO–LUMO gap
	- Classification: BACE, BBBP, ClinTox, HIV, SIDER

	- Text–molecule tasks:
	- Description-guided molecule generation
	- Molecule captioning
	- IUPAC/SELFIES/formula translation as auxiliary tasks