MassSpecGym De Novo: Generative SMILES Transformer

This is an Encoder-Decoder Transformer designed to "translate" mass spectrometry peaks directly into chemical structures (SMILES strings) without a candidate database.

Model Details

  • Architecture: Spectral Encoder-Decoder with Intensity Rank Embeddings.
  • Objective: Cross-Entropy with Label Smoothing and Teacher Forcing.
  • Inference: Beam Search (k=5) with Length Penalty.
  • Output: Generative SMILES strings representing the molecular structure.

Performance (MassSpecGym Test Set)

The model outperforms standard generative baselines:

  • Top-1 Tanimoto Similarity: 0.108
  • Top-5 Tanimoto Similarity: 0.130
  • MCES Distance (Top-1): 37.93
  • Exact Match Accuracy: 0.0% (Consistent with SOTA for this benchmark).

Key Features

  • Intensity Rank Embeddings: Prioritizes strong spectral signals to guide the generation process.
  • Beam Search Decoding: Explores multiple structural paths to find globally optimal candidates.
  • Teacher Forcing: Ensures stable training and fast convergence for complex chemical grammar.

Usage

Full training and generation scripts are available at the GitHub Repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support