README.md · docling-project/MarkushGrapher-2 at main

MarkushGrapher-2 / README.md

TimStrohmeyer

Update README.md

2e1091d verified 21 days ago

preview code

raw

history blame contribute delete

4.36 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- chemistry
	- markush
	- cxsmiles
	- molecular-structure
	- ocr
	- document-understanding
	- image-to-text
	- vision-language-model
	- patent-analysis
	library_name: transformers
	pipeline_tag: image-to-text
	datasets:
	- docling-project/MarkushGrapher-2-Datasets
	---

	<p align="center">
	<img src="markushgrapher_2_repo_banner.png" alt="MarkushGrapher-2 Banner" width="100%">
	</p>

	MarkushGrapher-2 is an end-to-end multimodal model for recognizing chemical structures from patent document images. It jointly encodes vision, text, and layout information to convert Markush structure images into machine-readable CXSMILES representations.

	## Overview

	MarkushGrapher-2 is a transformer-based model that integrates two complementary encoders:
	- A Vision Encoder (Swin-B ViT), pretrained for Optical Chemical Structure Recognition (OCSR) (taken from Model: MolScribe*)
	- A Vision-Text-Layout (VTL) Encoder (T5-base), trained for Markush feature extraction

	The model also includes ChemicalOCR, a dedicated OCR module fine-tuned for chemical images, enabling fully end-to-end processing without external OCR dependencies.

	*MolScribe: https://github.com/thomas0809/MolScribe

	### Architecture

	The input image is processed through two parallel pipelines:
	1. Vision pipeline: The image is encoded by the OCSR vision encoder and projected via an MLP.
	2. VTL pipeline: The image is passed through ChemicalOCR to extract text and bounding boxes, which are then fused with image patches in the VTL encoder.

	The outputs of both pipelines are concatenated and fed to a text decoder that autoregressively generates a CXSMILES sequence describing the Markush backbone and a substituent table.

	### Two-Stage Training Strategy

	- Phase 1 (Adaptation): The Vision Encoder is frozen while the projector and text decoder are trained on 243K real-world image-SMILES pairs for standard molecular structure recognition.
	- Phase 2 (Fusion): The VTL encoder is introduced and trained jointly with the text decoder on 235K synthetic and 145K real-world Markush structure samples for CXSMILES and substituent table prediction.

	## Performance

	### Markush Structure Recognition (CXSMILES Accuracy)

	\| Benchmark \| MarkushGrapher-2 \| MolParser-Base \| MolScribe \| MarkushGrapher-1 \| DeepSeek-OCR \| GPT-5 \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| M2S (103) \| 56 \| 39 \| 21 \| 38 \| 0 \| 3 \|
	\| USPTO-M (74) \| 55 \| 30 \| 7 \| 32 \| 0 \| - \|
	\| WildMol-M (10K) \| 48.0 \| 38.1 \| 28.1 \| - \| 1.9 \| - \|
	\| IP5-M (1K) \| 53.7 \| 47.7 \| 22.3 \| - \| 0.0 \| - \|

	### Molecular Structure Recognition (SMILES Accuracy)

	\| Benchmark \| MarkushGrapher-2 \| MolParser-Base \| MolScribe \| MolGrapher \|
	\|---\|---\|---\|---\|---\|
	\| WildMol (10K) \| 68.4 \| 76.9 \| 66.4 \| 45.5 \|
	\| JPO (450) \| 71.0 \| 78.9 \| 76.2 \| 67.5 \|
	\| UOB (5.7K) \| 96.6 \| 91.8 \| 87.4 \| 94.9 \|
	\| USPTO (5.7K) \| 89.8 \| 93.0 \| 93.1 \| 91.5 \|

	## Model Details

	- Parameters: 831M total (744M trainable)
	- Vision Encoder: Swin-B ViT (from MolScribe, frozen during Phase 2)
	- VTL Encoder/Decoder: T5-base architecture with UDOP fusion
	- OCR Module: ChemicalOCR (256M parameters, based on SmolDocling)
	- Input: Patent document image crop (1024x1024)
	- Output: CXSMILES sequence + substituent table

	## Usage

	```python
	# Load the model checkpoint
	from transformers import AutoModel

	model = AutoModel.from_pretrained("docling-project/MarkushGrapher-2")
	```

	For full inference pipeline usage, see the [MarkushGrapher repository](https://github.com/DS4SD/MarkushGrapher).

	## Datasets

	Training and evaluation datasets are available at [docling-project/MarkushGrapher-2-Datasets](https://huggingface.co/datasets/docling-project/MarkushGrapher-2-Datasets).

	## Citation

	```bibtex
	@inproceedings{strohmeyer2026markushgrapher2,
	title = {MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures},
	author = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.},
	booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
	year = {2026}
	}
	```

	## License

	This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).