File size: 9,629 Bytes

---
license: cc-by-nc-nd-4.0
tags:
- mass-spectrometry
- molecular-formula
- dissolved-organic-matter
- machine-learning
- scikit-learn
library_name: sklearn
datasets:
- SaeedLab/dom-formula-assignment-data
---

# DOM Formula Assignment using Machine Learning


![Model Type](https://img.shields.io/badge/Model-ML-blue)
![Data](https://img.shields.io/badge/Data-FT--ICR_MS-green)
![License](https://img.shields.io/badge/License-CC_BY_NC_ND_4-yellow)
[![GitHub](https://img.shields.io/badge/GitHub-pcdslab/dom--formula--assignment--using--ml-blue?logo=github)](https://github.com/pcdslab/dom-formula-assignment-using-ml)
[![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/SaeedLab/dom-formula-assignment-data)

**Training and Testing Data for A Machine Learning and Benchmarking Approach for Molecular Formula Assignment of Ultra High-Resolution Mass Spectrometry Data from Complex Mixtures**

> **Paper**: Under review
> **Dataset**: [SaeedLab/dom-formula-assignment-data](https://huggingface.co/datasets/SaeedLab/dom-formula-assignment-data)

---

## Abstract
A machine learning approach to molecular formula assignment is crucial for unlocking the full potential of ultra-high resolution mass spectrometry (UHRMS) when analyzing complex mixtures. By combining data-driven models with rigorous benchmarking, the accuracy, consistency, and speed in identifying plausible molecular formulas from vast spectral datasets can be improved. Compared with traditional de novo methods that rely heavily on rule-based heuristics, and manual parameter tuning, machine learning approaches can capture complex patterns in data and adapt more readily to diverse sample types. In this paper, we describe the application of a machine learning methods using the k-nearest neighbors (KNN) algorithm trained on curated chemical formula datasets of UHRMS analysis of dissolved organic matter (DOM) covering the saline river continuum and tropical wet/dry season variability. The influence of the mass accuracy (training set with 0.15-1ppm) was evaluated on a blind test set of DOMs of different geographical origins. A Decision Tree Regressor (DTR) and Random Forest Regressor (RFR) based on mass accuracy (<1ppm) was used. Results from our ML models exhibit 43% more formulas annotated than traditional methods (5796 vs 4047), Model-Synthetic achieved 99.9% assignment rate and annotated/assigned 2x more formulas (8,268 vs 4047). DTR and RFR achieved formula-level accuracies (FA) of 86.5% and 60.4%, respectively. Overall, results show an increase in formula assignment when compared with traditional methods. This ultimately enables more reliable characterization of complex natural and engineered systems, supporting advances in fields such as environmental science, metabolomics, and petroleomics. Furthermore, the novel data set produced for this study is made publicly available, establishing an initial benchmark for molecular formula assignment in UHRMS using machine learning. The dataset and code are publicly available at: https://github.com/pcdslab/dom-formula-assignment-using-ml

## Architecture

![DOM formula assignment architecture](architecture.png)
(a) KNN Pipeline for Formula Assignment/Annotation. For each train set, the first KNN model is fitted, and then formula assignment/annotation is performed using the closest peaks in the training set. (b) Decision Tree Regressor (DTR) and Random Forest Regressor (RFR) are trained to predict element counts for CHONS using squared-error criterion.
## Usage

Install the dependencies:

```bash
pip install "numpy==2.4.1" "scikit-learn==1.8.0" joblib huggingface_hub transformers torch pandas
```


### KNN Formula Assignment

Load the default KNN model:

```python
from transformers import AutoModel
import torch

model = AutoModel.from_pretrained(
    "SaeedLab/dom-formula-assignment-using-ml",
    trust_remote_code=True,
).eval()

# Mass-only input. Each value is one mass to assign.
masses = torch.tensor([350.123456, 421.987654], dtype=torch.float64)

with torch.no_grad():
    outputs = model(features=masses, return_neighbors=True, n_neighbors=3)

print("Predicted formulas:", outputs.predictions)
print("Neighbor indices:", outputs.indices)
```

Load a specific KNN model by passing `model_name`:

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "SaeedLab/dom-formula-assignment-using-ml",
    trust_remote_code=True,
    model_name="L1-L3_K3_Manhattan_Ensemble",
).eval()
```

Use the `Model Name` value from the table below. For example, `Synthetic_K3_Euclidean_Ensemble` selects the default synthetic KNN ensemble.

There is no tokenizer step for these KNN models. Inputs are 1D numeric vectors of mass values.

### Get nearest neighbors

KNN models can also return the closest training examples for each input sample:

```python
neighbor_indices = model.neighbor_indices(masses, n_neighbors=3)

print("Neighbor indices:")
print(neighbor_indices)
```

Use the model variant that matches your experiment setup:

- `L1`, `L3`, `L1-L3`, or `Synthetic` for the training dataset.
- `K1` or `K3` for the number of neighbors.
- `Euclidean` or `Manhattan` for the distance metric.
- `Ensemble`, `7T`, `21T`, or `SYN` for combined, field-strength, or synthetic backing models.

### Decision Tree and Random Forest Formula Regressors

The Decision Tree and Random Forest models predict CHONS element counts from mass and ion mobility features. Inputs must be numeric rows in this order:

```text
[mz, inv_k0, ccs]
```

Here, `inv_k0` is `1/k0`, and both `inv_k0` and `ccs` are mobility-derived values.

```python
from transformers import AutoModel
import torch

model = AutoModel.from_pretrained(
    "SaeedLab/dom-formula-assignment-using-ml",
    trust_remote_code=True,
    model_name="RandomForest",
).eval()

features = torch.tensor(
    [
        [191.03498, 0.6808667247437136, 146.2048497276655],
        [191.071388, 0.7087887538442335, 152.19879035313514],
        [191.10775, 0.6935974811483341, 148.93494377261592],
    ],
    dtype=torch.float64,
)

with torch.no_grad():
    outputs = model(features=features)

print("Predicted formulas:", outputs.formulas)
print("Predicted CHONS counts:", outputs.formula_counts)
```

The CHONS output columns are ordered as:

```text
[C, H, O, N, S]
```

## KNN Models

This repository contains pre-trained KNN models used for molecular formula assignment. The models vary by:
- **Dataset**: L1 (7T), L3 (21T), L1-L3, Synthetic
- **Neighbors (K)**: 1, 3
- **Distance Metric**: Euclidean, Manhattan
- **Field Strength/Type**: Ensemble, 7T, 21T, SYN (Synthetic)

### Model List

| Model Name | Description |
|---|---|
| `L1_K1_Euclidean` | L1 (7T) KNN model with 1 neighbor and Euclidean distance |
| `L1_K1_Manhattan` | L1 (7T) KNN model with 1 neighbor and Manhattan distance |
| `L1_K3_Euclidean` | L1 (7T) KNN model with 3 neighbors and Euclidean distance |
| `L1_K3_Manhattan` | L1 (7T) KNN model with 3 neighbors and Manhattan distance |
| `L3_K1_Euclidean` | L3 (21T) KNN model with 1 neighbor and Euclidean distance |
| `L3_K1_Manhattan` | L3 (21T) KNN model with 1 neighbor and Manhattan distance |
| `L3_K3_Euclidean` | L3 (21T) KNN model with 3 neighbors and Euclidean distance |
| `L3_K3_Manhattan` | L3 (21T) KNN model with 3 neighbors and Manhattan distance |
| `L1-L3_K1_Euclidean_Ensemble` | Ensemble of L1 (7T) and L3 (21T) KNN models with 1 neighbor and Euclidean distance |
| `L1-L3_K1_Manhattan_Ensemble` | Ensemble of L1 (7T) and L3 (21T) KNN models with 1 neighbor and Manhattan distance |
| `L1-L3_K3_Euclidean_Ensemble` | Ensemble of L1 (7T) and L3 (21T) KNN models with 3 neighbors and Euclidean distance |
| `L1-L3_K3_Manhattan_Ensemble` | Ensemble of L1 (7T) and L3 (21T) KNN models with 3 neighbors and Manhattan distance |
| `Synthetic_K1_Euclidean_Ensemble` | Ensemble of Synthetic 7T, 21T, and SYN KNN models with 1 neighbor and Euclidean distance |
| `Synthetic_K1_Manhattan_Ensemble` | Ensemble of Synthetic 7T, 21T, and SYN KNN models with 1 neighbor and Manhattan distance |
| `Synthetic_K3_Euclidean_Ensemble` | Ensemble of Synthetic 7T, 21T, and SYN KNN models with 3 neighbors and Euclidean distance |
| `Synthetic_K3_Manhattan_Ensemble` | Ensemble of Synthetic 7T, 21T, and SYN KNN models with 3 neighbors and Manhattan distance |

## Decision Tree and Random Forest Models

| Model Name | Description |
|---|---|
| `DecisionTree` | Predicts CHONS counts from `mz`, `inv_k0`, and `ccs` |
| `RandomForest` | Predicts CHONS counts from `mz`, `inv_k0`, and `ccs` |

## License

This model and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of this model and its derivatives, which include models trained on outputs from the model or datasets created from the model, is prohibited and requires prior approval. Downloading the model requires prior registration on Hugging Face and agreeing to the terms of use. By downloading this model, you agree not to distribute, publish or reproduce a copy of the model. If another user within your organization wishes to use the model, they must register as an individual user and agree to comply with the terms of use. Users may not attempt to re-identify the deidentified data used to develop the underlying model. If you are a commercial entity, please contact the corresponding author.

---


## Contact

For any additional questions or comments, contact Fahad Saeed (fsaeed@fiu.edu).

---