SPECTR / README.md

updated README

c391fd9 verified 2 months ago

5.94 kB

	---
	license: mit
	datasets:
	- k050506koch/SPECTR
	pipeline_tag: image-text-to-text
	---

	# SPECTR

	A deep learning model for predicting chemical formulas from mass spectrometry data using a CNN-Transformer architecture.

	## Overview

	SPECTR is a machine learning system that accepts mass spectral data (m/z peaks and intensities) and predicts the corresponding chemical formula. The model uses a convolutional neural network encoder to process spectral peaks and a transformer decoder to generate chemical formulas token by token.

	## Key Features

	- CNN-based encoder for processing mass spectrometry peaks
	- Transformer decoder for sequential chemical formula generation
	- Support for all chemical elements from H to Og
	- Handles up to 300 peaks per spectrum
	- Multiple decoding strategies: greedy, beam search, top-k, and top-p sampling
	- Training with gradient accumulation and learning rate scheduling
	- Integration with Weights & Biases for experiment tracking
	- Checkpoint saving and resumption support

	## Architecture

	The model consists of two main components:

	### Encoder
	- 4-layer CNN with batch normalization and max pooling
	- Processes m/z and intensity pairs
	- Global adaptive pooling for consistent output size
	- Output: encoded representation of spectral data

	### Decoder
	- Transformer decoder with multi-head attention
	- Token embedding with positional encoding
	- Causal masking for autoregressive generation
	- Vocabulary: special tokens, chemical elements (H-Og), and digits (0-9)

	## Requirements

	- Python 3.8+
	- PyTorch 2.0+
	- NumPy
	- pandas
	- scikit-learn
	- wandb (for experiment tracking)
	- tqdm (optional, for progress bars)

	Additional dependencies for data preparation:
	- requests
	- beautifulsoup4

	## Installation

	1. Clone the repository:
	```bash
	git clone https://github.com/krll-corp/SPECTR.git
	cd SPECTR
	```

	2. Install dependencies:
	```bash
	pip install torch numpy pandas scikit-learn wandb tqdm requests beautifulsoup4
	```

	## Data Format

	The model expects data in JSONL format with the following structure:

	```json
	{"formula": "C6H12O6", "peaks": [{"m/z": 180.063, "intensity": 999}, {"m/z": 145.050, "intensity": 450}]}
	```

	Each line contains:
	- `formula`: Chemical formula as a string (e.g., "C6H12O6")
	- `peaks`: List of peak objects with `m/z` (mass-to-charge ratio) and `intensity` values

	## Usage

	### Training

	Train the model using the `train_conv.py` script:

	```bash
	python train_conv.py
	```

	Training options:
	- `--resume`: Resume from previous run if available
	- `--checkpoint`: Path to checkpoint file (default: `checkpoint_last.pt`)
	- `--save-every`: Save checkpoint every N steps (default: 500)

	The script expects a data file at `../mona_massbank_dataset.jsonl`. Modify the `data_file` variable in the script to point to your dataset.

	Training hyperparameters (configurable in the script):
	- Model dimension: 256
	- Attention heads: 8
	- Encoder/decoder layers: 8
	- Batch size: 64
	- Learning rate: 1e-4
	- Max training steps: 10,000
	- Gradient accumulation: 16 steps
	- Warmup steps: 3% of max steps

	### Evaluation

	Evaluate a trained model using the `eval_conv3.py` script:

	```bash
	python eval_conv3.py --checkpoint <path_to_checkpoint> --data <path_to_data> --device cuda
	```

	Options:
	- `--checkpoint`: Path to trained model checkpoint (default: `checkpoint_best.pt`)
	- `--data`: Path to evaluation data file (default: `massbank_dataset.jsonl`)
	- `--device`: Device to use (choices: `cpu`, `cuda`, `mps`)
	- `--strategy`: Decoding strategy (choices: `greedy`, `beam`, `top_k`, `top_p`; default: `greedy`)
	- `--beam-width`: Beam width for beam search (default: 3)
	- `--top-k`: Top-k value for top-k sampling (default: 5)
	- `--top-p`: Top-p value for nucleus sampling (default: 0.9)
	- `--limit`: Limit number of samples to evaluate (optional)

	### Data Preparation

	The `data/` directory contains utilities for data collection and preparation:

	- `crawler.py`: Extract mass spectrometry data from MassBank web records
	- `filtering.py`: Process and filter MoNA and MassBank datasets into training format
	- `analyze_massbank.py`: Analyze MassBank dataset statistics

	Example usage for data crawling:
	```bash
	cd data
	python crawler.py
	```

	## Model Specifications

	Default configuration:
	- Input: Up to 300 mass spectrometry peaks (m/z, intensity pairs)
	- Output: Chemical formula up to 50 tokens
	- Vocabulary size: 135 tokens (4 special + 118 elements + 10 digits + 3 reserved)
	- Model parameters: ~12M (base configuration)

	Supported chemical elements: All elements from H (Hydrogen) to Og (Oganesson)

	Special tokens:
	- `<PAD>`: Padding token
	- `<SOS>`: Start of sequence
	- `<EOS>`: End of sequence
	- `<UNK>`: Unknown token

	## Training Features

	- Automatic mixed precision training support
	- TF32 precision for Ampere GPUs
	- Learning rate warmup and cosine decay
	- Gradient accumulation for effective larger batch sizes
	- Token-level accuracy metric (excluding padding)
	- Validation every 1000 steps
	- Automatic checkpoint saving
	- Resume capability from saved checkpoints
	- Weights & Biases integration for tracking

	## Performance

	The model is evaluated using:
	- Cross-entropy loss (ignoring padding tokens)
	- Token-level accuracy
	- Perplexity
	- Exact match accuracy (during evaluation)

	## License

	This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

	Copyright (c) 2025 Kyryll Kochkin

	## Citation

	If you use SPECTR in your research, please cite this repository:

	```
	@software{spectr2025,
	author = {Kochkin, Kyryll},
	title = {SPECTR: Mass Spectrometry to Chemical Formula Prediction},
	year = {2025},
	url = {https://github.com/krll-corp/SPECTR}
	}
	```

	## Contributing

	Contributions are welcome. Please open an issue or submit a pull request for any improvements or bug fixes.

	## Acknowledgments

	This project uses data from:
	- MassBank: A public repository of mass spectra
	- MoNA (MassBank of North America): Mass spectral database