README.md · docling-project/MarkushGrapher-2 at main

File size: 4,363 Bytes

---
license: apache-2.0
language:
  - en
tags:
  - chemistry
  - markush
  - cxsmiles
  - molecular-structure
  - ocr
  - document-understanding
  - image-to-text
  - vision-language-model
  - patent-analysis
library_name: transformers
pipeline_tag: image-to-text
datasets:
  - docling-project/MarkushGrapher-2-Datasets
---

<p align="center">
  <img src="markushgrapher_2_repo_banner.png" alt="MarkushGrapher-2 Banner" width="100%">
</p>

**MarkushGrapher-2** is an end-to-end multimodal model for recognizing chemical structures from patent document images. It jointly encodes vision, text, and layout information to convert Markush structure images into machine-readable CXSMILES representations.

## Overview

MarkushGrapher-2 is a transformer-based model that integrates two complementary encoders:
- A **Vision Encoder** (Swin-B ViT), pretrained for Optical Chemical Structure Recognition (OCSR) (taken from Model: MolScribe*)
- A **Vision-Text-Layout (VTL) Encoder** (T5-base), trained for Markush feature extraction

The model also includes **ChemicalOCR**, a dedicated OCR module fine-tuned for chemical images, enabling fully end-to-end processing without external OCR dependencies.

*MolScribe: https://github.com/thomas0809/MolScribe

### Architecture

The input image is processed through two parallel pipelines:
1. **Vision pipeline**: The image is encoded by the OCSR vision encoder and projected via an MLP.
2. **VTL pipeline**: The image is passed through ChemicalOCR to extract text and bounding boxes, which are then fused with image patches in the VTL encoder.

The outputs of both pipelines are concatenated and fed to a text decoder that autoregressively generates a CXSMILES sequence describing the Markush backbone and a substituent table.

### Two-Stage Training Strategy

- **Phase 1 (Adaptation)**: The Vision Encoder is frozen while the projector and text decoder are trained on 243K real-world image-SMILES pairs for standard molecular structure recognition.
- **Phase 2 (Fusion)**: The VTL encoder is introduced and trained jointly with the text decoder on 235K synthetic and 145K real-world Markush structure samples for CXSMILES and substituent table prediction.

## Performance

### Markush Structure Recognition (CXSMILES Accuracy)

| Benchmark | MarkushGrapher-2 | MolParser-Base | MolScribe | MarkushGrapher-1 | DeepSeek-OCR | GPT-5 |
|---|---|---|---|---|---|---|
| **M2S** (103) | **56** | 39 | 21 | 38 | 0 | 3 |
| **USPTO-M** (74) | **55** | 30 | 7 | 32 | 0 | - |
| **WildMol-M** (10K) | **48.0** | 38.1 | 28.1 | - | 1.9 | - |
| **IP5-M** (1K) | **53.7** | 47.7 | 22.3 | - | 0.0 | - |

### Molecular Structure Recognition (SMILES Accuracy)

| Benchmark | MarkushGrapher-2 | MolParser-Base | MolScribe | MolGrapher |
|---|---|---|---|---|
| **WildMol** (10K) | 68.4 | **76.9** | 66.4 | 45.5 |
| **JPO** (450) | 71.0 | **78.9** | 76.2 | 67.5 |
| **UOB** (5.7K) | **96.6** | 91.8 | 87.4 | 94.9 |
| **USPTO** (5.7K) | 89.8 | 93.0 | **93.1** | 91.5 |

## Model Details

- **Parameters**: 831M total (744M trainable)
- **Vision Encoder**: Swin-B ViT (from MolScribe, frozen during Phase 2)
- **VTL Encoder/Decoder**: T5-base architecture with UDOP fusion
- **OCR Module**: ChemicalOCR (256M parameters, based on SmolDocling)
- **Input**: Patent document image crop (1024x1024)
- **Output**: CXSMILES sequence + substituent table

## Usage

```python
# Load the model checkpoint
from transformers import AutoModel

model = AutoModel.from_pretrained("docling-project/MarkushGrapher-2")
```

For full inference pipeline usage, see the [MarkushGrapher repository](https://github.com/DS4SD/MarkushGrapher).

## Datasets

Training and evaluation datasets are available at [docling-project/MarkushGrapher-2-Datasets](https://huggingface.co/datasets/docling-project/MarkushGrapher-2-Datasets).

## Citation

```bibtex
@inproceedings{strohmeyer2026markushgrapher2,
  title     = {MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures},
  author    = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}
```

## License

This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).