--- license: mit datasets: - k050506koch/SPECTR pipeline_tag: image-text-to-text --- # SPECTR A deep learning model for predicting chemical formulas from mass spectrometry data using a CNN-Transformer architecture. ## Overview SPECTR is a machine learning system that accepts mass spectral data (m/z peaks and intensities) and predicts the corresponding chemical formula. The model uses a convolutional neural network encoder to process spectral peaks and a transformer decoder to generate chemical formulas token by token. ## Key Features - CNN-based encoder for processing mass spectrometry peaks - Transformer decoder for sequential chemical formula generation - Support for all chemical elements from H to Og - Handles up to 300 peaks per spectrum - Multiple decoding strategies: greedy, beam search, top-k, and top-p sampling - Training with gradient accumulation and learning rate scheduling - Integration with Weights & Biases for experiment tracking - Checkpoint saving and resumption support ## Architecture The model consists of two main components: ### Encoder - 4-layer CNN with batch normalization and max pooling - Processes m/z and intensity pairs - Global adaptive pooling for consistent output size - Output: encoded representation of spectral data ### Decoder - Transformer decoder with multi-head attention - Token embedding with positional encoding - Causal masking for autoregressive generation - Vocabulary: special tokens, chemical elements (H-Og), and digits (0-9) ## Requirements - Python 3.8+ - PyTorch 2.0+ - NumPy - pandas - scikit-learn - wandb (for experiment tracking) - tqdm (optional, for progress bars) Additional dependencies for data preparation: - requests - beautifulsoup4 ## Installation 1. Clone the repository: ```bash git clone https://github.com/krll-corp/SPECTR.git cd SPECTR ``` 2. Install dependencies: ```bash pip install torch numpy pandas scikit-learn wandb tqdm requests beautifulsoup4 ``` ## Data Format The model expects data in JSONL format with the following structure: ```json {"formula": "C6H12O6", "peaks": [{"m/z": 180.063, "intensity": 999}, {"m/z": 145.050, "intensity": 450}]} ``` Each line contains: - `formula`: Chemical formula as a string (e.g., "C6H12O6") - `peaks`: List of peak objects with `m/z` (mass-to-charge ratio) and `intensity` values ## Usage ### Training Train the model using the `train_conv.py` script: ```bash python train_conv.py ``` Training options: - `--resume`: Resume from previous run if available - `--checkpoint`: Path to checkpoint file (default: `checkpoint_last.pt`) - `--save-every`: Save checkpoint every N steps (default: 500) The script expects a data file at `../mona_massbank_dataset.jsonl`. Modify the `data_file` variable in the script to point to your dataset. Training hyperparameters (configurable in the script): - Model dimension: 256 - Attention heads: 8 - Encoder/decoder layers: 8 - Batch size: 64 - Learning rate: 1e-4 - Max training steps: 10,000 - Gradient accumulation: 16 steps - Warmup steps: 3% of max steps ### Evaluation Evaluate a trained model using the `eval_conv3.py` script: ```bash python eval_conv3.py --checkpoint --data --device cuda ``` Options: - `--checkpoint`: Path to trained model checkpoint (default: `checkpoint_best.pt`) - `--data`: Path to evaluation data file (default: `massbank_dataset.jsonl`) - `--device`: Device to use (choices: `cpu`, `cuda`, `mps`) - `--strategy`: Decoding strategy (choices: `greedy`, `beam`, `top_k`, `top_p`; default: `greedy`) - `--beam-width`: Beam width for beam search (default: 3) - `--top-k`: Top-k value for top-k sampling (default: 5) - `--top-p`: Top-p value for nucleus sampling (default: 0.9) - `--limit`: Limit number of samples to evaluate (optional) ### Data Preparation The `data/` directory contains utilities for data collection and preparation: - `crawler.py`: Extract mass spectrometry data from MassBank web records - `filtering.py`: Process and filter MoNA and MassBank datasets into training format - `analyze_massbank.py`: Analyze MassBank dataset statistics Example usage for data crawling: ```bash cd data python crawler.py ``` ## Model Specifications Default configuration: - Input: Up to 300 mass spectrometry peaks (m/z, intensity pairs) - Output: Chemical formula up to 50 tokens - Vocabulary size: 135 tokens (4 special + 118 elements + 10 digits + 3 reserved) - Model parameters: ~12M (base configuration) Supported chemical elements: All elements from H (Hydrogen) to Og (Oganesson) Special tokens: - ``: Padding token - ``: Start of sequence - ``: End of sequence - ``: Unknown token ## Training Features - Automatic mixed precision training support - TF32 precision for Ampere GPUs - Learning rate warmup and cosine decay - Gradient accumulation for effective larger batch sizes - Token-level accuracy metric (excluding padding) - Validation every 1000 steps - Automatic checkpoint saving - Resume capability from saved checkpoints - Weights & Biases integration for tracking ## Performance The model is evaluated using: - Cross-entropy loss (ignoring padding tokens) - Token-level accuracy - Perplexity - Exact match accuracy (during evaluation) ## License This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. Copyright (c) 2025 Kyryll Kochkin ## Citation If you use SPECTR in your research, please cite this repository: ``` @software{spectr2025, author = {Kochkin, Kyryll}, title = {SPECTR: Mass Spectrometry to Chemical Formula Prediction}, year = {2025}, url = {https://github.com/krll-corp/SPECTR} } ``` ## Contributing Contributions are welcome. Please open an issue or submit a pull request for any improvements or bug fixes. ## Acknowledgments This project uses data from: - MassBank: A public repository of mass spectra - MoNA (MassBank of North America): Mass spectral database