File size: 6,045 Bytes
162f669
 
7abc4a3
 
a3f30bc
 
 
 
 
 
 
 
 
 
7abc4a3
a3f30bc
7abc4a3
a3f30bc
 
162f669
7abc4a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2075511
 
 
 
 
 
7abc4a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39020e7
7abc4a3
 
 
 
 
 
 
 
 
 
 
 
f5f53ed
6724d32
6390f2b
7abc4a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a3f30bc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
license: cc-by-nc-4.0
library_name: pytorch
tags:
- chemistry
- cheminformatics
- optical-chemical-structure-recognition
- ocsr
- molecule-recognition
- smiles
- transformer
- swin-transformer
- minimum-risk-training
- molecular-graph
datasets:
- Keylab/COMO
metrics:
- exact_match
- tanimoto_similarity
---

# COMO: Closed-Loop Optical Molecule Recognition

COMO (Closed-loop Optical Molecule recOgnition) is a deep learning framework for
Optical Chemical Structure Recognition (OCSR). It recognizes chemical structure
diagrams from images and predicts SMILES strings with atom-level 2D coordinates
and bond matrices. COMO uses Minimum Risk Training (MRT) to directly optimize
molecular-level, non-differentiable objectives, closing the gap between
token-level training and molecular-level evaluation.

## Model Summary

- **Architecture:** Swin-B encoder β†’ 6-layer Transformer decoder β†’ bond MLP
- **Input:** 384Γ—384 RGB image of a chemical structure diagram
- **Output:** SMILES string + atom coordinates + bond matrix
- **Vocabulary:** chartok_coords format (200 tokens: SMILES chars + 64 X/Y bins)
- **Parameters:** ~94M
- **Training data:** 1M PubChem + 652K USPTO (MLE) + 83K MolParser-SFT (MRT)

## Available Checkpoints

All checkpoints are from the **joint MLE+MRT** training pipeline (30 epochs,
interleaved MLE/MRT from scratch). Three reward variants are provided:

| Checkpoint | Reward Mode | Description |
|-----------|-------------|-------------|
| `models/tanimoto/final.pth` | Tanimoto | Morgan fingerprint Tanimoto similarity reward |
| `models/tanimoto/best.pth` | Tanimoto | Best validation epoch |
| `models/edit_distance/final.pth` | Edit Distance | Levenshtein string-similarity reward |
| `models/edit_distance/best.pth` | Edit Distance | Best validation epoch |
| `models/visual/final.pth` | Visual | Siamese visual-encoder cosine-similarity reward |
| `models/visual/best.pth` | Visual | Best validation epoch |

## Architecture

```
Image (384Γ—384)
  β†’ Swin-B backbone (ImageNet pretrained)
    β†’ 2D sinusoidal positional encoding
      β†’ 6-layer Transformer decoder (d=256, 8 heads)
        β†’ chartok_coords tokens β†’ SMILES + coordinates
        β†’ Bond MLP (2-layer, GELU) β†’ 7-class bond matrix
          β†’ Graph reconstruction β†’ canonical SMILES
```

The model outputs a molecular graph $G = (A, B)$ where:
- $A = \{(l_i, x_i, y_i)\}$ β€” atom SMILES labels with 2D image coordinates
- $B$ β€” pairwise bond types (none, single, double, triple, aromatic, wedge, dash)

## Training

### MLE Phase
- **Data:** 1M PubChem SMILES (synthetic) + 652K USPTO patent molecules
- **Augmentation:** Indigo-rendered images with random styles, functional group
  substitution, R-group insertion, wavy bonds, scan shadows, multilingual comments
- **Optimizer:** AdamW, lr=4Γ—10⁻⁴ (encoder & decoder), weight decay=10⁻⁢
- **Schedule:** 2% linear warmup β†’ cosine decay, batch size 64/GPU
- **Loss:** Label-smoothed cross-entropy (Ξ΅=0.1) + bond classification CE

### MRT Phase
- **Data:** 83K real-world molecular images (MolParser-SFT)
- **Candidates:** N=32 per image, multinomial sampling at Ο„=0.5
- **Reward weights:** validity=0.1, similarity=0.5, exact match=0.4
- **Sharpening:** Ξ±=1.0, loss weight Ξ»=0.1
- **Schedule:** First 5 epochs MLE-only warmup, then interleaved MLE+MRT

## Evaluation Results

Exact match accuracy (%) on 10 benchmarks (COMO-Tanimoto variant):

| Benchmark | Images | Synthetic/Real | COMO-Tanimoto |
|-----------|--------|----------------|---------------|
| Indigo | 5,719 | Synthetic | 98.6 |
| ChemDraw | 5,719 | Synthetic | 96.5 |
| CLEF | 992 | Real (patents) | 94.8 |
| JPO | 450 | Real (patents) | 88.4 |
| UOB | 5,740 | Real (academic) | 98.0* |
| USPTO | 5,719 | Real (patents) | 93.4 |
| USPTO-10K | 10,000 | Real (patents) | 96.1 |
| Staker | 50,000 | Real | 87.4 |
| ACS | 331 | Real (publications) | 84.6 |
| WildMol-10K | 10,000 | Real (wild) | 77.1 |

*\*UOB results after tautomer standardization.*

See the [paper](#citation) for full comparison with MolScribe, MolParser,
SwinOCSR, and other baselines.

## Installation

```bash
pip install como-ocsr
```

## Usage

```python
import como

# Download checkpoint from HuggingFace:
# huggingface-cli download Keylab/COMO models/tanimoto/final.pth

model = como.load_model("models/tanimoto/final.pth", device="cuda")

# Single image prediction
smiles = como.predict(model, "molecule.png")
print(smiles)  # "CC(=O)O"

# Batch prediction
smiles_list = como.predict_batch(model, ["mol1.png", "mol2.png"])

# Benchmark evaluation
metrics = como.evaluate(model, "benchmark/USPTO/", "benchmark/USPTO.csv")
print(f"Exact Match: {metrics['postprocess/exact_match_acc']:.2%}")
```

Full documentation: [Github](https://github.com/netknowledge/COMO) [PyPI](https://pypi.org/project/como-ocsr/)

## Benchmarks

Benchmark datasets are available in the `benchmarks/` directory of this
repository. Each dataset contains `.png` images and a CSV file with columns
`image_id` and `SMILES`.

**Note:** These benchmarks are collected from existing public OCSR datasets.
Please refer to the original sources for attribution:

| Dataset | Source |
|---------|--------|
| USPTO, CLEF, JPO, UOB, Staker | [Rajan et al., 2020](https://github.com/Kohulan/OCSR_Review), [Xiong et al., 2023](https://github.com/jiachengxiong/alpha-Extractor) |
| Indigo, ChemDraw, ACS, Staker | [Qian et al., 2023](https://github.com/thomas0809/MolScribe) |
| USPTO-10K | [Morin et al., 2023](https://huggingface.co/datasets/docling-project/USPTO-30K) |
| WildMol-10K | [Fang et al., 2025](https://github.com/orgs/Chem-Struct-ML/repositories) |

## License

- **Model Weights:** CC BY-NC 4.0 (non-commercial use only)
- **Code:** MIT License
- **Benchmarks:** See original sources for applicable terms

## Citation

```bibtex
@article{lyu2026closed,
  title={COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training},
  author={Lyu, Zhuoqi and Ke, Qing},
  journal={arXiv preprint arXiv:2604.23546},
  year={2026}
}
```