| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - chemistry |
| - markush |
| - cxsmiles |
| - molecular-structure |
| - ocr |
| - document-understanding |
| - image-to-text |
| - vision-language-model |
| - patent-analysis |
| library_name: transformers |
| pipeline_tag: image-to-text |
| datasets: |
| - docling-project/MarkushGrapher-2-Datasets |
| --- |
| |
| <p align="center"> |
| <img src="markushgrapher_2_repo_banner.png" alt="MarkushGrapher-2 Banner" width="100%"> |
| </p> |
|
|
| **MarkushGrapher-2** is an end-to-end multimodal model for recognizing chemical structures from patent document images. It jointly encodes vision, text, and layout information to convert Markush structure images into machine-readable CXSMILES representations. |
|
|
| ## Overview |
|
|
| MarkushGrapher-2 is a transformer-based model that integrates two complementary encoders: |
| - A **Vision Encoder** (Swin-B ViT), pretrained for Optical Chemical Structure Recognition (OCSR) (taken from Model: MolScribe*) |
| - A **Vision-Text-Layout (VTL) Encoder** (T5-base), trained for Markush feature extraction |
| |
| The model also includes **ChemicalOCR**, a dedicated OCR module fine-tuned for chemical images, enabling fully end-to-end processing without external OCR dependencies. |
| |
| *MolScribe: https://github.com/thomas0809/MolScribe |
|
|
| ### Architecture |
|
|
| The input image is processed through two parallel pipelines: |
| 1. **Vision pipeline**: The image is encoded by the OCSR vision encoder and projected via an MLP. |
| 2. **VTL pipeline**: The image is passed through ChemicalOCR to extract text and bounding boxes, which are then fused with image patches in the VTL encoder. |
|
|
| The outputs of both pipelines are concatenated and fed to a text decoder that autoregressively generates a CXSMILES sequence describing the Markush backbone and a substituent table. |
|
|
| ### Two-Stage Training Strategy |
|
|
| - **Phase 1 (Adaptation)**: The Vision Encoder is frozen while the projector and text decoder are trained on 243K real-world image-SMILES pairs for standard molecular structure recognition. |
| - **Phase 2 (Fusion)**: The VTL encoder is introduced and trained jointly with the text decoder on 235K synthetic and 145K real-world Markush structure samples for CXSMILES and substituent table prediction. |
|
|
| ## Performance |
|
|
| ### Markush Structure Recognition (CXSMILES Accuracy) |
|
|
| | Benchmark | MarkushGrapher-2 | MolParser-Base | MolScribe | MarkushGrapher-1 | DeepSeek-OCR | GPT-5 | |
| |---|---|---|---|---|---|---| |
| | **M2S** (103) | **56** | 39 | 21 | 38 | 0 | 3 | |
| | **USPTO-M** (74) | **55** | 30 | 7 | 32 | 0 | - | |
| | **WildMol-M** (10K) | **48.0** | 38.1 | 28.1 | - | 1.9 | - | |
| | **IP5-M** (1K) | **53.7** | 47.7 | 22.3 | - | 0.0 | - | |
|
|
| ### Molecular Structure Recognition (SMILES Accuracy) |
|
|
| | Benchmark | MarkushGrapher-2 | MolParser-Base | MolScribe | MolGrapher | |
| |---|---|---|---|---| |
| | **WildMol** (10K) | 68.4 | **76.9** | 66.4 | 45.5 | |
| | **JPO** (450) | 71.0 | **78.9** | 76.2 | 67.5 | |
| | **UOB** (5.7K) | **96.6** | 91.8 | 87.4 | 94.9 | |
| | **USPTO** (5.7K) | 89.8 | 93.0 | **93.1** | 91.5 | |
|
|
| ## Model Details |
|
|
| - **Parameters**: 831M total (744M trainable) |
| - **Vision Encoder**: Swin-B ViT (from MolScribe, frozen during Phase 2) |
| - **VTL Encoder/Decoder**: T5-base architecture with UDOP fusion |
| - **OCR Module**: ChemicalOCR (256M parameters, based on SmolDocling) |
| - **Input**: Patent document image crop (1024x1024) |
| - **Output**: CXSMILES sequence + substituent table |
|
|
| ## Usage |
|
|
| ```python |
| # Load the model checkpoint |
| from transformers import AutoModel |
| |
| model = AutoModel.from_pretrained("docling-project/MarkushGrapher-2") |
| ``` |
|
|
| For full inference pipeline usage, see the [MarkushGrapher repository](https://github.com/DS4SD/MarkushGrapher). |
|
|
| ## Datasets |
|
|
| Training and evaluation datasets are available at [docling-project/MarkushGrapher-2-Datasets](https://huggingface.co/datasets/docling-project/MarkushGrapher-2-Datasets). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{strohmeyer2026markushgrapher2, |
| title = {MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures}, |
| author = {Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Valery and Nassar, Ahmed and Staar, Peter W. J.}, |
| booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
| year = {2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). |
|
|