Keylab
/

COMO

 ---
+language: en
 license: cc-by-nc-4.0
+library_name: pytorch
+tags:
+  - chemistry
+  - cheminformatics
+  - optical-chemical-structure-recognition
+  - ocsr
+  - molecule-recognition
+  - smiles
+  - transformer
+  - swin-transformer
+  - minimum-risk-training
+  - molecular-graph
+datasets:
+  - Keylab/COMO
+metrics:
+  - exact_match
+  - tanimoto_similarity
+  - tautomer_match
 ---
+# COMO: Closed-Loop Optical Molecule Recognition
+COMO (Closed-loop Optical Molecule recOgnition) is a deep learning framework for
+Optical Chemical Structure Recognition (OCSR). It recognizes chemical structure
+diagrams from images and predicts SMILES strings with atom-level 2D coordinates
+and bond matrices. COMO uses Minimum Risk Training (MRT) to directly optimize
+molecular-level, non-differentiable objectives, closing the gap between
+token-level training and molecular-level evaluation.
+## Model Summary
+- **Architecture:** Swin-B encoder → 6-layer Transformer decoder → bond MLP
+- **Input:** 384×384 RGB image of a chemical structure diagram
+- **Output:** SMILES string + atom coordinates + bond matrix
+- **Vocabulary:** chartok_coords format (200 tokens: SMILES chars + 64 X/Y bins)
+- **Parameters:** ~94M
+- **Training data:** 1M PubChem + 652K USPTO (MLE) + 83K MolParser-SFT (MRT)
+## Available Checkpoints
+All checkpoints are from the **joint MLE+MRT** training pipeline (30 epochs,
+interleaved MLE/MRT from scratch). Three reward variants are provided:
+| Checkpoint | Reward Mode | Description |
+|-----------|-------------|-------------|
+| `models/tanimoto/final.pth` | Tanimoto | Morgan fingerprint Tanimoto similarity reward |
+| `models/tanimoto/best.pth` | Tanimoto | Best validation epoch |
+| `models/edit_distance/final.pth` | Edit Distance | Levenshtein string-similarity reward |
+| `models/edit_distance/best.pth` | Edit Distance | Best validation epoch |
+| `models/visual/final.pth` | Visual | Siamese visual-encoder cosine-similarity reward |
+| `models/visual/best.pth` | Visual | Best validation epoch |
+## Architecture
+```
+Image (384×384)
+  → Swin-B backbone (ImageNet pretrained)
+    → 2D sinusoidal positional encoding
+      → 6-layer Transformer decoder (d=256, 8 heads)
+        → chartok_coords tokens → SMILES + coordinates
+        → Bond MLP (2-layer, GELU) → 7-class bond matrix
+          → Graph reconstruction → canonical SMILES
+```
+The model outputs a molecular graph $G = (A, B)$ where:
+- $A = \{(l_i, x_i, y_i)\}$ — atom SMILES labels with 2D image coordinates
+- $B$ — pairwise bond types (none, single, double, triple, aromatic, wedge, dash)
+## Training
+### MLE Phase
+- **Data:** 1M PubChem SMILES (synthetic) + 652K USPTO patent molecules
+- **Augmentation:** Indigo-rendered images with random styles, functional group
+  substitution, R-group insertion, wavy bonds, scan shadows, multilingual comments
+- **Optimizer:** AdamW, lr=4×10⁻⁴ (encoder & decoder), weight decay=10⁻⁶
+- **Schedule:** 2% linear warmup → cosine decay, batch size 64/GPU
+- **Loss:** Label-smoothed cross-entropy (ε=0.1) + bond classification CE
+### MRT Phase
+- **Data:** 83K real-world molecular images (MolParser-SFT)
+- **Candidates:** N=32 per image, multinomial sampling at τ=0.5
+- **Reward weights:** validity=0.1, similarity=0.5, exact match=0.4
+- **Sharpening:** α=1.0, loss weight λ=0.1
+- **Schedule:** First 5 epochs MLE-only warmup, then interleaved MLE+MRT
+## Evaluation Results
+Exact match accuracy (%) on 10 benchmarks (COMO-Tanimoto variant):
+| Benchmark | Images | Synthetic/Real | COMO-Tanimoto |
+|-----------|--------|----------------|---------------|
+| Indigo | 5,719 | Synthetic | 98.6 |
+| ChemDraw | 5,719 | Synthetic | 96.5 |
+| CLEF | 992 | Real (patents) | 94.8 |
+| JPO | 450 | Real (patents) | 88.4 |
+| UOB | 5,740 | Real (academic) | 98.0* |
+| USPTO | 5,719 | Real (patents) | 93.4 |
+| USPTO-10K | 10,000 | Real (patents) | 96.1 |
+| Staker | 50,000 | Real | 87.4 |
+| ACS | 331 | Real (publications) | 84.6 |
+| WildMol-10K | 10,000 | Real (wild) | 77.1 |
+*\*UOB results after tautomer standardization.*
+See the [paper](#citation) for full comparison with MolScribe, MolParser,
+SwinOCSR, and other baselines.
+## Usage
+```python
+import como
+# Download checkpoint from HuggingFace:
+# huggingface-cli download Keylab/COMO models/tanimoto/final.pth
+model = como.load_model("models/tanimoto/final.pth", device="cuda")
+# Single image prediction
+smiles = como.predict(model, "molecule.png")
+print(smiles)  # "CC(=O)O"
+# Batch prediction
+smiles_list = como.predict_batch(model, ["mol1.png", "mol2.png"])
+# Benchmark evaluation
+metrics = como.evaluate(model, "benchmark/USPTO/", "benchmark/USPTO.csv")
+print(f"Exact Match: {metrics['postprocess/exact_match_acc']:.2%}")
+```
+Full documentation: [como-ocsr on PyPI](https://pypi.org/project/como-ocsr/)
+## Benchmarks
+Benchmark datasets are available in the `benchmarks/` directory of this
+repository. Each dataset contains `.png` images and a CSV file with columns
+`image_id` and `SMILES`.
+**Note:** These benchmarks are collected from existing public OCSR datasets.
+Please refer to the original sources for attribution:
+| Dataset | Source |
+|---------|--------|
+| USPTO, CLEF, JPO, UOB, Staker | [Rajan et al., 2020](https://github.com/Kohulan/DECIMER-Image_Transformer) |
+| Indigo, ChemDraw, ACS | [Qian et al., 2023](https://github.com/thomas0809/MolScribe) |
+| USPTO-10K | [Morin et al., 2023](https://github.com/DS4SD/molgrapher) |
+| WildMol-10K | [Fang et al., 2025](https://github.com/orgs/Chem-Struct-ML/repositories) |
+## Limitations
+1. **Functional group abbreviations** (e.g., "Allyl", "Boc"): COMO may fail to
+   expand uncommon abbreviations that are rare in the training distribution.
+2. **Charged species**: Formally charged functional groups (diazonium, azide) are
+   sometimes confused with their neutral counterparts.
+3. **Document context**: Neighboring text or reaction labels can contaminate
+   predictions (hallucinated fragments).
+4. **Stereochemistry**: While postprocessing restores chirality from predicted
+   coordinates, complex E/Z isomerism may be unreliable.
+5. The model is designed for **single-molecule** images. Multi-molecule or
+   reaction diagrams are out of scope.
+## License
+- **Model Weights:** CC BY-NC 4.0 (non-commercial use only)
+- **Code:** MIT License
+- **Benchmarks:** See original sources for applicable terms
+## Citation
+```bibtex
+@article{lyu2026closed,
+  title={COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training},
+  author={Lyu, Zhuoqi and Ke, Qing},
+  journal={arXiv preprint arXiv:2604.23546},
+  year={2026}
+}
+```