lyuzhuoqi commited on
Commit
7abc4a3
Β·
verified Β·
1 Parent(s): 4217611

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +176 -0
README.md CHANGED
@@ -1,3 +1,179 @@
1
  ---
 
2
  license: cc-by-nc-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
  license: cc-by-nc-4.0
4
+ library_name: pytorch
5
+ tags:
6
+ - chemistry
7
+ - cheminformatics
8
+ - optical-chemical-structure-recognition
9
+ - ocsr
10
+ - molecule-recognition
11
+ - smiles
12
+ - transformer
13
+ - swin-transformer
14
+ - minimum-risk-training
15
+ - molecular-graph
16
+ datasets:
17
+ - Keylab/COMO
18
+ metrics:
19
+ - exact_match
20
+ - tanimoto_similarity
21
+ - tautomer_match
22
  ---
23
+
24
+ # COMO: Closed-Loop Optical Molecule Recognition
25
+
26
+ COMO (Closed-loop Optical Molecule recOgnition) is a deep learning framework for
27
+ Optical Chemical Structure Recognition (OCSR). It recognizes chemical structure
28
+ diagrams from images and predicts SMILES strings with atom-level 2D coordinates
29
+ and bond matrices. COMO uses Minimum Risk Training (MRT) to directly optimize
30
+ molecular-level, non-differentiable objectives, closing the gap between
31
+ token-level training and molecular-level evaluation.
32
+
33
+ ## Model Summary
34
+
35
+ - **Architecture:** Swin-B encoder β†’ 6-layer Transformer decoder β†’ bond MLP
36
+ - **Input:** 384Γ—384 RGB image of a chemical structure diagram
37
+ - **Output:** SMILES string + atom coordinates + bond matrix
38
+ - **Vocabulary:** chartok_coords format (200 tokens: SMILES chars + 64 X/Y bins)
39
+ - **Parameters:** ~94M
40
+ - **Training data:** 1M PubChem + 652K USPTO (MLE) + 83K MolParser-SFT (MRT)
41
+
42
+ ## Available Checkpoints
43
+
44
+ All checkpoints are from the **joint MLE+MRT** training pipeline (30 epochs,
45
+ interleaved MLE/MRT from scratch). Three reward variants are provided:
46
+
47
+ | Checkpoint | Reward Mode | Description |
48
+ |-----------|-------------|-------------|
49
+ | `models/tanimoto/final.pth` | Tanimoto | Morgan fingerprint Tanimoto similarity reward |
50
+ | `models/tanimoto/best.pth` | Tanimoto | Best validation epoch |
51
+ | `models/edit_distance/final.pth` | Edit Distance | Levenshtein string-similarity reward |
52
+ | `models/edit_distance/best.pth` | Edit Distance | Best validation epoch |
53
+ | `models/visual/final.pth` | Visual | Siamese visual-encoder cosine-similarity reward |
54
+ | `models/visual/best.pth` | Visual | Best validation epoch |
55
+
56
+ ## Architecture
57
+
58
+ ```
59
+ Image (384Γ—384)
60
+ β†’ Swin-B backbone (ImageNet pretrained)
61
+ β†’ 2D sinusoidal positional encoding
62
+ β†’ 6-layer Transformer decoder (d=256, 8 heads)
63
+ β†’ chartok_coords tokens β†’ SMILES + coordinates
64
+ β†’ Bond MLP (2-layer, GELU) β†’ 7-class bond matrix
65
+ β†’ Graph reconstruction β†’ canonical SMILES
66
+ ```
67
+
68
+ The model outputs a molecular graph $G = (A, B)$ where:
69
+ - $A = \{(l_i, x_i, y_i)\}$ β€” atom SMILES labels with 2D image coordinates
70
+ - $B$ β€” pairwise bond types (none, single, double, triple, aromatic, wedge, dash)
71
+
72
+ ## Training
73
+
74
+ ### MLE Phase
75
+ - **Data:** 1M PubChem SMILES (synthetic) + 652K USPTO patent molecules
76
+ - **Augmentation:** Indigo-rendered images with random styles, functional group
77
+ substitution, R-group insertion, wavy bonds, scan shadows, multilingual comments
78
+ - **Optimizer:** AdamW, lr=4Γ—10⁻⁴ (encoder & decoder), weight decay=10⁻⁢
79
+ - **Schedule:** 2% linear warmup β†’ cosine decay, batch size 64/GPU
80
+ - **Loss:** Label-smoothed cross-entropy (Ξ΅=0.1) + bond classification CE
81
+
82
+ ### MRT Phase
83
+ - **Data:** 83K real-world molecular images (MolParser-SFT)
84
+ - **Candidates:** N=32 per image, multinomial sampling at Ο„=0.5
85
+ - **Reward weights:** validity=0.1, similarity=0.5, exact match=0.4
86
+ - **Sharpening:** Ξ±=1.0, loss weight Ξ»=0.1
87
+ - **Schedule:** First 5 epochs MLE-only warmup, then interleaved MLE+MRT
88
+
89
+ ## Evaluation Results
90
+
91
+ Exact match accuracy (%) on 10 benchmarks (COMO-Tanimoto variant):
92
+
93
+ | Benchmark | Images | Synthetic/Real | COMO-Tanimoto |
94
+ |-----------|--------|----------------|---------------|
95
+ | Indigo | 5,719 | Synthetic | 98.6 |
96
+ | ChemDraw | 5,719 | Synthetic | 96.5 |
97
+ | CLEF | 992 | Real (patents) | 94.8 |
98
+ | JPO | 450 | Real (patents) | 88.4 |
99
+ | UOB | 5,740 | Real (academic) | 98.0* |
100
+ | USPTO | 5,719 | Real (patents) | 93.4 |
101
+ | USPTO-10K | 10,000 | Real (patents) | 96.1 |
102
+ | Staker | 50,000 | Real | 87.4 |
103
+ | ACS | 331 | Real (publications) | 84.6 |
104
+ | WildMol-10K | 10,000 | Real (wild) | 77.1 |
105
+
106
+ *\*UOB results after tautomer standardization.*
107
+
108
+ See the [paper](#citation) for full comparison with MolScribe, MolParser,
109
+ SwinOCSR, and other baselines.
110
+
111
+ ## Usage
112
+
113
+ ```python
114
+ import como
115
+
116
+ # Download checkpoint from HuggingFace:
117
+ # huggingface-cli download Keylab/COMO models/tanimoto/final.pth
118
+
119
+ model = como.load_model("models/tanimoto/final.pth", device="cuda")
120
+
121
+ # Single image prediction
122
+ smiles = como.predict(model, "molecule.png")
123
+ print(smiles) # "CC(=O)O"
124
+
125
+ # Batch prediction
126
+ smiles_list = como.predict_batch(model, ["mol1.png", "mol2.png"])
127
+
128
+ # Benchmark evaluation
129
+ metrics = como.evaluate(model, "benchmark/USPTO/", "benchmark/USPTO.csv")
130
+ print(f"Exact Match: {metrics['postprocess/exact_match_acc']:.2%}")
131
+ ```
132
+
133
+ Full documentation: [como-ocsr on PyPI](https://pypi.org/project/como-ocsr/)
134
+
135
+ ## Benchmarks
136
+
137
+ Benchmark datasets are available in the `benchmarks/` directory of this
138
+ repository. Each dataset contains `.png` images and a CSV file with columns
139
+ `image_id` and `SMILES`.
140
+
141
+ **Note:** These benchmarks are collected from existing public OCSR datasets.
142
+ Please refer to the original sources for attribution:
143
+
144
+ | Dataset | Source |
145
+ |---------|--------|
146
+ | USPTO, CLEF, JPO, UOB, Staker | [Rajan et al., 2020](https://github.com/Kohulan/DECIMER-Image_Transformer) |
147
+ | Indigo, ChemDraw, ACS | [Qian et al., 2023](https://github.com/thomas0809/MolScribe) |
148
+ | USPTO-10K | [Morin et al., 2023](https://github.com/DS4SD/molgrapher) |
149
+ | WildMol-10K | [Fang et al., 2025](https://github.com/orgs/Chem-Struct-ML/repositories) |
150
+
151
+ ## Limitations
152
+
153
+ 1. **Functional group abbreviations** (e.g., "Allyl", "Boc"): COMO may fail to
154
+ expand uncommon abbreviations that are rare in the training distribution.
155
+ 2. **Charged species**: Formally charged functional groups (diazonium, azide) are
156
+ sometimes confused with their neutral counterparts.
157
+ 3. **Document context**: Neighboring text or reaction labels can contaminate
158
+ predictions (hallucinated fragments).
159
+ 4. **Stereochemistry**: While postprocessing restores chirality from predicted
160
+ coordinates, complex E/Z isomerism may be unreliable.
161
+ 5. The model is designed for **single-molecule** images. Multi-molecule or
162
+ reaction diagrams are out of scope.
163
+
164
+ ## License
165
+
166
+ - **Model Weights:** CC BY-NC 4.0 (non-commercial use only)
167
+ - **Code:** MIT License
168
+ - **Benchmarks:** See original sources for applicable terms
169
+
170
+ ## Citation
171
+
172
+ ```bibtex
173
+ @article{lyu2026closed,
174
+ title={COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training},
175
+ author={Lyu, Zhuoqi and Ke, Qing},
176
+ journal={arXiv preprint arXiv:2604.23546},
177
+ year={2026}
178
+ }
179
+ ```