|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- k050506koch/SPECTR |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# SPECTR |
|
|
|
|
|
A deep learning model for predicting chemical formulas from mass spectrometry data using a CNN-Transformer architecture. |
|
|
|
|
|
## Overview |
|
|
|
|
|
SPECTR is a machine learning system that accepts mass spectral data (m/z peaks and intensities) and predicts the corresponding chemical formula. The model uses a convolutional neural network encoder to process spectral peaks and a transformer decoder to generate chemical formulas token by token. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- CNN-based encoder for processing mass spectrometry peaks |
|
|
- Transformer decoder for sequential chemical formula generation |
|
|
- Support for all chemical elements from H to Og |
|
|
- Handles up to 300 peaks per spectrum |
|
|
- Multiple decoding strategies: greedy, beam search, top-k, and top-p sampling |
|
|
- Training with gradient accumulation and learning rate scheduling |
|
|
- Integration with Weights & Biases for experiment tracking |
|
|
- Checkpoint saving and resumption support |
|
|
|
|
|
## Architecture |
|
|
|
|
|
The model consists of two main components: |
|
|
|
|
|
### Encoder |
|
|
- 4-layer CNN with batch normalization and max pooling |
|
|
- Processes m/z and intensity pairs |
|
|
- Global adaptive pooling for consistent output size |
|
|
- Output: encoded representation of spectral data |
|
|
|
|
|
### Decoder |
|
|
- Transformer decoder with multi-head attention |
|
|
- Token embedding with positional encoding |
|
|
- Causal masking for autoregressive generation |
|
|
- Vocabulary: special tokens, chemical elements (H-Og), and digits (0-9) |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- Python 3.8+ |
|
|
- PyTorch 2.0+ |
|
|
- NumPy |
|
|
- pandas |
|
|
- scikit-learn |
|
|
- wandb (for experiment tracking) |
|
|
- tqdm (optional, for progress bars) |
|
|
|
|
|
Additional dependencies for data preparation: |
|
|
- requests |
|
|
- beautifulsoup4 |
|
|
|
|
|
## Installation |
|
|
|
|
|
1. Clone the repository: |
|
|
```bash |
|
|
git clone https://github.com/krll-corp/SPECTR.git |
|
|
cd SPECTR |
|
|
``` |
|
|
|
|
|
2. Install dependencies: |
|
|
```bash |
|
|
pip install torch numpy pandas scikit-learn wandb tqdm requests beautifulsoup4 |
|
|
``` |
|
|
|
|
|
## Data Format |
|
|
|
|
|
The model expects data in JSONL format with the following structure: |
|
|
|
|
|
```json |
|
|
{"formula": "C6H12O6", "peaks": [{"m/z": 180.063, "intensity": 999}, {"m/z": 145.050, "intensity": 450}]} |
|
|
``` |
|
|
|
|
|
Each line contains: |
|
|
- `formula`: Chemical formula as a string (e.g., "C6H12O6") |
|
|
- `peaks`: List of peak objects with `m/z` (mass-to-charge ratio) and `intensity` values |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Training |
|
|
|
|
|
Train the model using the `train_conv.py` script: |
|
|
|
|
|
```bash |
|
|
python train_conv.py |
|
|
``` |
|
|
|
|
|
Training options: |
|
|
- `--resume`: Resume from previous run if available |
|
|
- `--checkpoint`: Path to checkpoint file (default: `checkpoint_last.pt`) |
|
|
- `--save-every`: Save checkpoint every N steps (default: 500) |
|
|
|
|
|
The script expects a data file at `../mona_massbank_dataset.jsonl`. Modify the `data_file` variable in the script to point to your dataset. |
|
|
|
|
|
Training hyperparameters (configurable in the script): |
|
|
- Model dimension: 256 |
|
|
- Attention heads: 8 |
|
|
- Encoder/decoder layers: 8 |
|
|
- Batch size: 64 |
|
|
- Learning rate: 1e-4 |
|
|
- Max training steps: 10,000 |
|
|
- Gradient accumulation: 16 steps |
|
|
- Warmup steps: 3% of max steps |
|
|
|
|
|
### Evaluation |
|
|
|
|
|
Evaluate a trained model using the `eval_conv3.py` script: |
|
|
|
|
|
```bash |
|
|
python eval_conv3.py --checkpoint <path_to_checkpoint> --data <path_to_data> --device cuda |
|
|
``` |
|
|
|
|
|
Options: |
|
|
- `--checkpoint`: Path to trained model checkpoint (default: `checkpoint_best.pt`) |
|
|
- `--data`: Path to evaluation data file (default: `massbank_dataset.jsonl`) |
|
|
- `--device`: Device to use (choices: `cpu`, `cuda`, `mps`) |
|
|
- `--strategy`: Decoding strategy (choices: `greedy`, `beam`, `top_k`, `top_p`; default: `greedy`) |
|
|
- `--beam-width`: Beam width for beam search (default: 3) |
|
|
- `--top-k`: Top-k value for top-k sampling (default: 5) |
|
|
- `--top-p`: Top-p value for nucleus sampling (default: 0.9) |
|
|
- `--limit`: Limit number of samples to evaluate (optional) |
|
|
|
|
|
### Data Preparation |
|
|
|
|
|
The `data/` directory contains utilities for data collection and preparation: |
|
|
|
|
|
- `crawler.py`: Extract mass spectrometry data from MassBank web records |
|
|
- `filtering.py`: Process and filter MoNA and MassBank datasets into training format |
|
|
- `analyze_massbank.py`: Analyze MassBank dataset statistics |
|
|
|
|
|
Example usage for data crawling: |
|
|
```bash |
|
|
cd data |
|
|
python crawler.py |
|
|
``` |
|
|
|
|
|
## Model Specifications |
|
|
|
|
|
Default configuration: |
|
|
- Input: Up to 300 mass spectrometry peaks (m/z, intensity pairs) |
|
|
- Output: Chemical formula up to 50 tokens |
|
|
- Vocabulary size: 135 tokens (4 special + 118 elements + 10 digits + 3 reserved) |
|
|
- Model parameters: ~12M (base configuration) |
|
|
|
|
|
Supported chemical elements: All elements from H (Hydrogen) to Og (Oganesson) |
|
|
|
|
|
Special tokens: |
|
|
- `<PAD>`: Padding token |
|
|
- `<SOS>`: Start of sequence |
|
|
- `<EOS>`: End of sequence |
|
|
- `<UNK>`: Unknown token |
|
|
|
|
|
## Training Features |
|
|
|
|
|
- Automatic mixed precision training support |
|
|
- TF32 precision for Ampere GPUs |
|
|
- Learning rate warmup and cosine decay |
|
|
- Gradient accumulation for effective larger batch sizes |
|
|
- Token-level accuracy metric (excluding padding) |
|
|
- Validation every 1000 steps |
|
|
- Automatic checkpoint saving |
|
|
- Resume capability from saved checkpoints |
|
|
- Weights & Biases integration for tracking |
|
|
|
|
|
## Performance |
|
|
|
|
|
The model is evaluated using: |
|
|
- Cross-entropy loss (ignoring padding tokens) |
|
|
- Token-level accuracy |
|
|
- Perplexity |
|
|
- Exact match accuracy (during evaluation) |
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. |
|
|
|
|
|
Copyright (c) 2025 Kyryll Kochkin |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use SPECTR in your research, please cite this repository: |
|
|
|
|
|
``` |
|
|
@software{spectr2025, |
|
|
author = {Kochkin, Kyryll}, |
|
|
title = {SPECTR: Mass Spectrometry to Chemical Formula Prediction}, |
|
|
year = {2025}, |
|
|
url = {https://github.com/krll-corp/SPECTR} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contributing |
|
|
|
|
|
Contributions are welcome. Please open an issue or submit a pull request for any improvements or bug fixes. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This project uses data from: |
|
|
- MassBank: A public repository of mass spectra |
|
|
- MoNA (MassBank of North America): Mass spectral database |
|
|
|