SPECTR / README.md
k050506koch's picture
updated README
c391fd9 verified
---
license: mit
datasets:
- k050506koch/SPECTR
pipeline_tag: image-text-to-text
---
# SPECTR
A deep learning model for predicting chemical formulas from mass spectrometry data using a CNN-Transformer architecture.
## Overview
SPECTR is a machine learning system that accepts mass spectral data (m/z peaks and intensities) and predicts the corresponding chemical formula. The model uses a convolutional neural network encoder to process spectral peaks and a transformer decoder to generate chemical formulas token by token.
## Key Features
- CNN-based encoder for processing mass spectrometry peaks
- Transformer decoder for sequential chemical formula generation
- Support for all chemical elements from H to Og
- Handles up to 300 peaks per spectrum
- Multiple decoding strategies: greedy, beam search, top-k, and top-p sampling
- Training with gradient accumulation and learning rate scheduling
- Integration with Weights & Biases for experiment tracking
- Checkpoint saving and resumption support
## Architecture
The model consists of two main components:
### Encoder
- 4-layer CNN with batch normalization and max pooling
- Processes m/z and intensity pairs
- Global adaptive pooling for consistent output size
- Output: encoded representation of spectral data
### Decoder
- Transformer decoder with multi-head attention
- Token embedding with positional encoding
- Causal masking for autoregressive generation
- Vocabulary: special tokens, chemical elements (H-Og), and digits (0-9)
## Requirements
- Python 3.8+
- PyTorch 2.0+
- NumPy
- pandas
- scikit-learn
- wandb (for experiment tracking)
- tqdm (optional, for progress bars)
Additional dependencies for data preparation:
- requests
- beautifulsoup4
## Installation
1. Clone the repository:
```bash
git clone https://github.com/krll-corp/SPECTR.git
cd SPECTR
```
2. Install dependencies:
```bash
pip install torch numpy pandas scikit-learn wandb tqdm requests beautifulsoup4
```
## Data Format
The model expects data in JSONL format with the following structure:
```json
{"formula": "C6H12O6", "peaks": [{"m/z": 180.063, "intensity": 999}, {"m/z": 145.050, "intensity": 450}]}
```
Each line contains:
- `formula`: Chemical formula as a string (e.g., "C6H12O6")
- `peaks`: List of peak objects with `m/z` (mass-to-charge ratio) and `intensity` values
## Usage
### Training
Train the model using the `train_conv.py` script:
```bash
python train_conv.py
```
Training options:
- `--resume`: Resume from previous run if available
- `--checkpoint`: Path to checkpoint file (default: `checkpoint_last.pt`)
- `--save-every`: Save checkpoint every N steps (default: 500)
The script expects a data file at `../mona_massbank_dataset.jsonl`. Modify the `data_file` variable in the script to point to your dataset.
Training hyperparameters (configurable in the script):
- Model dimension: 256
- Attention heads: 8
- Encoder/decoder layers: 8
- Batch size: 64
- Learning rate: 1e-4
- Max training steps: 10,000
- Gradient accumulation: 16 steps
- Warmup steps: 3% of max steps
### Evaluation
Evaluate a trained model using the `eval_conv3.py` script:
```bash
python eval_conv3.py --checkpoint <path_to_checkpoint> --data <path_to_data> --device cuda
```
Options:
- `--checkpoint`: Path to trained model checkpoint (default: `checkpoint_best.pt`)
- `--data`: Path to evaluation data file (default: `massbank_dataset.jsonl`)
- `--device`: Device to use (choices: `cpu`, `cuda`, `mps`)
- `--strategy`: Decoding strategy (choices: `greedy`, `beam`, `top_k`, `top_p`; default: `greedy`)
- `--beam-width`: Beam width for beam search (default: 3)
- `--top-k`: Top-k value for top-k sampling (default: 5)
- `--top-p`: Top-p value for nucleus sampling (default: 0.9)
- `--limit`: Limit number of samples to evaluate (optional)
### Data Preparation
The `data/` directory contains utilities for data collection and preparation:
- `crawler.py`: Extract mass spectrometry data from MassBank web records
- `filtering.py`: Process and filter MoNA and MassBank datasets into training format
- `analyze_massbank.py`: Analyze MassBank dataset statistics
Example usage for data crawling:
```bash
cd data
python crawler.py
```
## Model Specifications
Default configuration:
- Input: Up to 300 mass spectrometry peaks (m/z, intensity pairs)
- Output: Chemical formula up to 50 tokens
- Vocabulary size: 135 tokens (4 special + 118 elements + 10 digits + 3 reserved)
- Model parameters: ~12M (base configuration)
Supported chemical elements: All elements from H (Hydrogen) to Og (Oganesson)
Special tokens:
- `<PAD>`: Padding token
- `<SOS>`: Start of sequence
- `<EOS>`: End of sequence
- `<UNK>`: Unknown token
## Training Features
- Automatic mixed precision training support
- TF32 precision for Ampere GPUs
- Learning rate warmup and cosine decay
- Gradient accumulation for effective larger batch sizes
- Token-level accuracy metric (excluding padding)
- Validation every 1000 steps
- Automatic checkpoint saving
- Resume capability from saved checkpoints
- Weights & Biases integration for tracking
## Performance
The model is evaluated using:
- Cross-entropy loss (ignoring padding tokens)
- Token-level accuracy
- Perplexity
- Exact match accuracy (during evaluation)
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
Copyright (c) 2025 Kyryll Kochkin
## Citation
If you use SPECTR in your research, please cite this repository:
```
@software{spectr2025,
author = {Kochkin, Kyryll},
title = {SPECTR: Mass Spectrometry to Chemical Formula Prediction},
year = {2025},
url = {https://github.com/krll-corp/SPECTR}
}
```
## Contributing
Contributions are welcome. Please open an issue or submit a pull request for any improvements or bug fixes.
## Acknowledgments
This project uses data from:
- MassBank: A public repository of mass spectra
- MoNA (MassBank of North America): Mass spectral database