MarkushGrapher-2 Banner

MarkushGrapher-2 is an end-to-end multimodal model for recognizing chemical structures from patent document images. It jointly encodes vision, text, and layout information to convert Markush structure images into machine-readable CXSMILES representations.

Overview

MarkushGrapher-2 is a transformer-based model that integrates two complementary encoders:

  • A Vision Encoder (Swin-B ViT), pretrained for Optical Chemical Structure Recognition (OCSR) (taken from Model: MolScribe*)
  • A Vision-Text-Layout (VTL) Encoder (T5-base), trained for Markush feature extraction

The model also includes ChemicalOCR, a dedicated OCR module fine-tuned for chemical images, enabling fully end-to-end processing without external OCR dependencies.

*MolScribe: https://github.com/thomas0809/MolScribe

Architecture

The input image is processed through two parallel pipelines:

  1. Vision pipeline: The image is encoded by the OCSR vision encoder and projected via an MLP.
  2. VTL pipeline: The image is passed through ChemicalOCR to extract text and bounding boxes, which are then fused with image patches in the VTL encoder.

The outputs of both pipelines are concatenated and fed to a text decoder that autoregressively generates a CXSMILES sequence describing the Markush backbone and a substituent table.

Two-Stage Training Strategy

  • Phase 1 (Adaptation): The Vision Encoder is frozen while the projector and text decoder are trained on 243K real-world image-SMILES pairs for standard molecular structure recognition.
  • Phase 2 (Fusion): The VTL encoder is introduced and trained jointly with the text decoder on 235K synthetic and 145K real-world Markush structure samples for CXSMILES and substituent table prediction.

Performance

Markush Structure Recognition (CXSMILES Accuracy)

Benchmark MarkushGrapher-2 MolParser-Base MolScribe MarkushGrapher-1 DeepSeek-OCR GPT-5
M2S (103) 56 39 21 38 0 3
USPTO-M (74) 55 30 7 32 0 -
WildMol-M (10K) 48.0 38.1 28.1 - 1.9 -
IP5-M (1K) 53.7 47.7 22.3 - 0.0 -

Molecular Structure Recognition (SMILES Accuracy)

Benchmark MarkushGrapher-2 MolParser-Base MolScribe MolGrapher
WildMol (10K) 68.4 76.9 66.4 45.5
JPO (450) 71.0 78.9 76.2 67.5
UOB (5.7K) 96.6 91.8 87.4 94.9
USPTO (5.7K) 89.8 93.0 93.1 91.5

Model Details

  • Parameters: 831M total (744M trainable)
  • Vision Encoder: Swin-B ViT (from MolScribe, frozen during Phase 2)
  • VTL Encoder/Decoder: T5-base architecture with UDOP fusion
  • OCR Module: ChemicalOCR (256M parameters, based on SmolDocling)
  • Input: Patent document image crop (1024x1024)
  • Output: CXSMILES sequence + substituent table

Usage

# Load the model checkpoint
from transformers import AutoModel

model = AutoModel.from_pretrained("docling-project/MarkushGrapher-2")

For full inference pipeline usage, see the MarkushGrapher repository.

Datasets

Training and evaluation datasets are available at docling-project/MarkushGrapher-2-Datasets.

Citation

@inproceedings{strohmeyer2026markushgrapher2,
  title     = {MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures},
  author    = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

License

This model is released under the Apache 2.0 License.

Downloads last month
39
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train docling-project/MarkushGrapher-2