File size: 6,045 Bytes
162f669 7abc4a3 a3f30bc 7abc4a3 a3f30bc 7abc4a3 a3f30bc 162f669 7abc4a3 2075511 7abc4a3 39020e7 7abc4a3 f5f53ed 6724d32 6390f2b 7abc4a3 a3f30bc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | ---
license: cc-by-nc-4.0
library_name: pytorch
tags:
- chemistry
- cheminformatics
- optical-chemical-structure-recognition
- ocsr
- molecule-recognition
- smiles
- transformer
- swin-transformer
- minimum-risk-training
- molecular-graph
datasets:
- Keylab/COMO
metrics:
- exact_match
- tanimoto_similarity
---
# COMO: Closed-Loop Optical Molecule Recognition
COMO (Closed-loop Optical Molecule recOgnition) is a deep learning framework for
Optical Chemical Structure Recognition (OCSR). It recognizes chemical structure
diagrams from images and predicts SMILES strings with atom-level 2D coordinates
and bond matrices. COMO uses Minimum Risk Training (MRT) to directly optimize
molecular-level, non-differentiable objectives, closing the gap between
token-level training and molecular-level evaluation.
## Model Summary
- **Architecture:** Swin-B encoder β 6-layer Transformer decoder β bond MLP
- **Input:** 384Γ384 RGB image of a chemical structure diagram
- **Output:** SMILES string + atom coordinates + bond matrix
- **Vocabulary:** chartok_coords format (200 tokens: SMILES chars + 64 X/Y bins)
- **Parameters:** ~94M
- **Training data:** 1M PubChem + 652K USPTO (MLE) + 83K MolParser-SFT (MRT)
## Available Checkpoints
All checkpoints are from the **joint MLE+MRT** training pipeline (30 epochs,
interleaved MLE/MRT from scratch). Three reward variants are provided:
| Checkpoint | Reward Mode | Description |
|-----------|-------------|-------------|
| `models/tanimoto/final.pth` | Tanimoto | Morgan fingerprint Tanimoto similarity reward |
| `models/tanimoto/best.pth` | Tanimoto | Best validation epoch |
| `models/edit_distance/final.pth` | Edit Distance | Levenshtein string-similarity reward |
| `models/edit_distance/best.pth` | Edit Distance | Best validation epoch |
| `models/visual/final.pth` | Visual | Siamese visual-encoder cosine-similarity reward |
| `models/visual/best.pth` | Visual | Best validation epoch |
## Architecture
```
Image (384Γ384)
β Swin-B backbone (ImageNet pretrained)
β 2D sinusoidal positional encoding
β 6-layer Transformer decoder (d=256, 8 heads)
β chartok_coords tokens β SMILES + coordinates
β Bond MLP (2-layer, GELU) β 7-class bond matrix
β Graph reconstruction β canonical SMILES
```
The model outputs a molecular graph $G = (A, B)$ where:
- $A = \{(l_i, x_i, y_i)\}$ β atom SMILES labels with 2D image coordinates
- $B$ β pairwise bond types (none, single, double, triple, aromatic, wedge, dash)
## Training
### MLE Phase
- **Data:** 1M PubChem SMILES (synthetic) + 652K USPTO patent molecules
- **Augmentation:** Indigo-rendered images with random styles, functional group
substitution, R-group insertion, wavy bonds, scan shadows, multilingual comments
- **Optimizer:** AdamW, lr=4Γ10β»β΄ (encoder & decoder), weight decay=10β»βΆ
- **Schedule:** 2% linear warmup β cosine decay, batch size 64/GPU
- **Loss:** Label-smoothed cross-entropy (Ξ΅=0.1) + bond classification CE
### MRT Phase
- **Data:** 83K real-world molecular images (MolParser-SFT)
- **Candidates:** N=32 per image, multinomial sampling at Ο=0.5
- **Reward weights:** validity=0.1, similarity=0.5, exact match=0.4
- **Sharpening:** Ξ±=1.0, loss weight Ξ»=0.1
- **Schedule:** First 5 epochs MLE-only warmup, then interleaved MLE+MRT
## Evaluation Results
Exact match accuracy (%) on 10 benchmarks (COMO-Tanimoto variant):
| Benchmark | Images | Synthetic/Real | COMO-Tanimoto |
|-----------|--------|----------------|---------------|
| Indigo | 5,719 | Synthetic | 98.6 |
| ChemDraw | 5,719 | Synthetic | 96.5 |
| CLEF | 992 | Real (patents) | 94.8 |
| JPO | 450 | Real (patents) | 88.4 |
| UOB | 5,740 | Real (academic) | 98.0* |
| USPTO | 5,719 | Real (patents) | 93.4 |
| USPTO-10K | 10,000 | Real (patents) | 96.1 |
| Staker | 50,000 | Real | 87.4 |
| ACS | 331 | Real (publications) | 84.6 |
| WildMol-10K | 10,000 | Real (wild) | 77.1 |
*\*UOB results after tautomer standardization.*
See the [paper](#citation) for full comparison with MolScribe, MolParser,
SwinOCSR, and other baselines.
## Installation
```bash
pip install como-ocsr
```
## Usage
```python
import como
# Download checkpoint from HuggingFace:
# huggingface-cli download Keylab/COMO models/tanimoto/final.pth
model = como.load_model("models/tanimoto/final.pth", device="cuda")
# Single image prediction
smiles = como.predict(model, "molecule.png")
print(smiles) # "CC(=O)O"
# Batch prediction
smiles_list = como.predict_batch(model, ["mol1.png", "mol2.png"])
# Benchmark evaluation
metrics = como.evaluate(model, "benchmark/USPTO/", "benchmark/USPTO.csv")
print(f"Exact Match: {metrics['postprocess/exact_match_acc']:.2%}")
```
Full documentation: [Github](https://github.com/netknowledge/COMO) [PyPI](https://pypi.org/project/como-ocsr/)
## Benchmarks
Benchmark datasets are available in the `benchmarks/` directory of this
repository. Each dataset contains `.png` images and a CSV file with columns
`image_id` and `SMILES`.
**Note:** These benchmarks are collected from existing public OCSR datasets.
Please refer to the original sources for attribution:
| Dataset | Source |
|---------|--------|
| USPTO, CLEF, JPO, UOB, Staker | [Rajan et al., 2020](https://github.com/Kohulan/OCSR_Review), [Xiong et al., 2023](https://github.com/jiachengxiong/alpha-Extractor) |
| Indigo, ChemDraw, ACS, Staker | [Qian et al., 2023](https://github.com/thomas0809/MolScribe) |
| USPTO-10K | [Morin et al., 2023](https://huggingface.co/datasets/docling-project/USPTO-30K) |
| WildMol-10K | [Fang et al., 2025](https://github.com/orgs/Chem-Struct-ML/repositories) |
## License
- **Model Weights:** CC BY-NC 4.0 (non-commercial use only)
- **Code:** MIT License
- **Benchmarks:** See original sources for applicable terms
## Citation
```bibtex
@article{lyu2026closed,
title={COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training},
author={Lyu, Zhuoqi and Ke, Qing},
journal={arXiv preprint arXiv:2604.23546},
year={2026}
}
``` |