MassSpecGym De Novo: Generative SMILES Transformer
This is an Encoder-Decoder Transformer designed to "translate" mass spectrometry peaks directly into chemical structures (SMILES strings) without a candidate database.
Model Details
- Architecture: Spectral Encoder-Decoder with Intensity Rank Embeddings.
- Objective: Cross-Entropy with Label Smoothing and Teacher Forcing.
- Inference: Beam Search (k=5) with Length Penalty.
- Output: Generative SMILES strings representing the molecular structure.
Performance (MassSpecGym Test Set)
The model outperforms standard generative baselines:
- Top-1 Tanimoto Similarity: 0.108
- Top-5 Tanimoto Similarity: 0.130
- MCES Distance (Top-1): 37.93
- Exact Match Accuracy: 0.0% (Consistent with SOTA for this benchmark).
Key Features
- Intensity Rank Embeddings: Prioritizes strong spectral signals to guide the generation process.
- Beam Search Decoding: Explores multiple structural paths to find globally optimal candidates.
- Teacher Forcing: Ensures stable training and fast convergence for complex chemical grammar.
Usage
Full training and generation scripts are available at the GitHub Repository.