MarkushGrapher-2 is an end-to-end multimodal model for recognizing chemical structures from patent document images. It jointly encodes vision, text, and layout information to convert Markush structure images into machine-readable CXSMILES representations.
Overview
MarkushGrapher-2 is a transformer-based model that integrates two complementary encoders:
- A Vision Encoder (Swin-B ViT), pretrained for Optical Chemical Structure Recognition (OCSR) (taken from Model: MolScribe*)
- A Vision-Text-Layout (VTL) Encoder (T5-base), trained for Markush feature extraction
The model also includes ChemicalOCR, a dedicated OCR module fine-tuned for chemical images, enabling fully end-to-end processing without external OCR dependencies.
*MolScribe: https://github.com/thomas0809/MolScribe
Architecture
The input image is processed through two parallel pipelines:
- Vision pipeline: The image is encoded by the OCSR vision encoder and projected via an MLP.
- VTL pipeline: The image is passed through ChemicalOCR to extract text and bounding boxes, which are then fused with image patches in the VTL encoder.
The outputs of both pipelines are concatenated and fed to a text decoder that autoregressively generates a CXSMILES sequence describing the Markush backbone and a substituent table.
Two-Stage Training Strategy
- Phase 1 (Adaptation): The Vision Encoder is frozen while the projector and text decoder are trained on 243K real-world image-SMILES pairs for standard molecular structure recognition.
- Phase 2 (Fusion): The VTL encoder is introduced and trained jointly with the text decoder on 235K synthetic and 145K real-world Markush structure samples for CXSMILES and substituent table prediction.
Performance
Markush Structure Recognition (CXSMILES Accuracy)
| Benchmark | MarkushGrapher-2 | MolParser-Base | MolScribe | MarkushGrapher-1 | DeepSeek-OCR | GPT-5 |
|---|---|---|---|---|---|---|
| M2S (103) | 56 | 39 | 21 | 38 | 0 | 3 |
| USPTO-M (74) | 55 | 30 | 7 | 32 | 0 | - |
| WildMol-M (10K) | 48.0 | 38.1 | 28.1 | - | 1.9 | - |
| IP5-M (1K) | 53.7 | 47.7 | 22.3 | - | 0.0 | - |
Molecular Structure Recognition (SMILES Accuracy)
| Benchmark | MarkushGrapher-2 | MolParser-Base | MolScribe | MolGrapher |
|---|---|---|---|---|
| WildMol (10K) | 68.4 | 76.9 | 66.4 | 45.5 |
| JPO (450) | 71.0 | 78.9 | 76.2 | 67.5 |
| UOB (5.7K) | 96.6 | 91.8 | 87.4 | 94.9 |
| USPTO (5.7K) | 89.8 | 93.0 | 93.1 | 91.5 |
Model Details
- Parameters: 831M total (744M trainable)
- Vision Encoder: Swin-B ViT (from MolScribe, frozen during Phase 2)
- VTL Encoder/Decoder: T5-base architecture with UDOP fusion
- OCR Module: ChemicalOCR (256M parameters, based on SmolDocling)
- Input: Patent document image crop (1024x1024)
- Output: CXSMILES sequence + substituent table
Usage
# Load the model checkpoint
from transformers import AutoModel
model = AutoModel.from_pretrained("docling-project/MarkushGrapher-2")
For full inference pipeline usage, see the MarkushGrapher repository.
Datasets
Training and evaluation datasets are available at docling-project/MarkushGrapher-2-Datasets.
Citation
@inproceedings{strohmeyer2026markushgrapher2,
title = {MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures},
author = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 39