Scikit-learn
Joblib
dom_ml
mass-spectrometry
molecular-formula
dissolved-organic-matter
machine-learning
scikit-learn
custom_code
Instructions to use SaeedLab/dom-formula-assignment-using-ml with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use SaeedLab/dom-formula-assignment-using-ml with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("SaeedLab/dom-formula-assignment-using-ml", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
File size: 9,629 Bytes
f922207 06b5da2 f922207 211dde8 f922207 211dde8 f922207 211dde8 f922207 211dde8 f922207 211dde8 f922207 211dde8 f922207 211dde8 f922207 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | ---
license: cc-by-nc-nd-4.0
tags:
- mass-spectrometry
- molecular-formula
- dissolved-organic-matter
- machine-learning
- scikit-learn
library_name: sklearn
datasets:
- SaeedLab/dom-formula-assignment-data
---
# DOM Formula Assignment using Machine Learning



[](https://github.com/pcdslab/dom-formula-assignment-using-ml)
[](https://huggingface.co/datasets/SaeedLab/dom-formula-assignment-data)
**Training and Testing Data for A Machine Learning and Benchmarking Approach for Molecular Formula Assignment of Ultra High-Resolution Mass Spectrometry Data from Complex Mixtures**
> **Paper**: Under review
> **Dataset**: [SaeedLab/dom-formula-assignment-data](https://huggingface.co/datasets/SaeedLab/dom-formula-assignment-data)
---
## Abstract
A machine learning approach to molecular formula assignment is crucial for unlocking the full potential of ultra-high resolution mass spectrometry (UHRMS) when analyzing complex mixtures. By combining data-driven models with rigorous benchmarking, the accuracy, consistency, and speed in identifying plausible molecular formulas from vast spectral datasets can be improved. Compared with traditional de novo methods that rely heavily on rule-based heuristics, and manual parameter tuning, machine learning approaches can capture complex patterns in data and adapt more readily to diverse sample types. In this paper, we describe the application of a machine learning methods using the k-nearest neighbors (KNN) algorithm trained on curated chemical formula datasets of UHRMS analysis of dissolved organic matter (DOM) covering the saline river continuum and tropical wet/dry season variability. The influence of the mass accuracy (training set with 0.15-1ppm) was evaluated on a blind test set of DOMs of different geographical origins. A Decision Tree Regressor (DTR) and Random Forest Regressor (RFR) based on mass accuracy (<1ppm) was used. Results from our ML models exhibit 43% more formulas annotated than traditional methods (5796 vs 4047), Model-Synthetic achieved 99.9% assignment rate and annotated/assigned 2x more formulas (8,268 vs 4047). DTR and RFR achieved formula-level accuracies (FA) of 86.5% and 60.4%, respectively. Overall, results show an increase in formula assignment when compared with traditional methods. This ultimately enables more reliable characterization of complex natural and engineered systems, supporting advances in fields such as environmental science, metabolomics, and petroleomics. Furthermore, the novel data set produced for this study is made publicly available, establishing an initial benchmark for molecular formula assignment in UHRMS using machine learning. The dataset and code are publicly available at: https://github.com/pcdslab/dom-formula-assignment-using-ml
## Architecture

(a) KNN Pipeline for Formula Assignment/Annotation. For each train set, the first KNN model is fitted, and then formula assignment/annotation is performed using the closest peaks in the training set. (b) Decision Tree Regressor (DTR) and Random Forest Regressor (RFR) are trained to predict element counts for CHONS using squared-error criterion.
## Usage
Install the dependencies:
```bash
pip install "numpy==2.4.1" "scikit-learn==1.8.0" joblib huggingface_hub transformers torch pandas
```
### KNN Formula Assignment
Load the default KNN model:
```python
from transformers import AutoModel
import torch
model = AutoModel.from_pretrained(
"SaeedLab/dom-formula-assignment-using-ml",
trust_remote_code=True,
).eval()
# Mass-only input. Each value is one mass to assign.
masses = torch.tensor([350.123456, 421.987654], dtype=torch.float64)
with torch.no_grad():
outputs = model(features=masses, return_neighbors=True, n_neighbors=3)
print("Predicted formulas:", outputs.predictions)
print("Neighbor indices:", outputs.indices)
```
Load a specific KNN model by passing `model_name`:
```python
from transformers import AutoModel
model = AutoModel.from_pretrained(
"SaeedLab/dom-formula-assignment-using-ml",
trust_remote_code=True,
model_name="L1-L3_K3_Manhattan_Ensemble",
).eval()
```
Use the `Model Name` value from the table below. For example, `Synthetic_K3_Euclidean_Ensemble` selects the default synthetic KNN ensemble.
There is no tokenizer step for these KNN models. Inputs are 1D numeric vectors of mass values.
### Get nearest neighbors
KNN models can also return the closest training examples for each input sample:
```python
neighbor_indices = model.neighbor_indices(masses, n_neighbors=3)
print("Neighbor indices:")
print(neighbor_indices)
```
Use the model variant that matches your experiment setup:
- `L1`, `L3`, `L1-L3`, or `Synthetic` for the training dataset.
- `K1` or `K3` for the number of neighbors.
- `Euclidean` or `Manhattan` for the distance metric.
- `Ensemble`, `7T`, `21T`, or `SYN` for combined, field-strength, or synthetic backing models.
### Decision Tree and Random Forest Formula Regressors
The Decision Tree and Random Forest models predict CHONS element counts from mass and ion mobility features. Inputs must be numeric rows in this order:
```text
[mz, inv_k0, ccs]
```
Here, `inv_k0` is `1/k0`, and both `inv_k0` and `ccs` are mobility-derived values.
```python
from transformers import AutoModel
import torch
model = AutoModel.from_pretrained(
"SaeedLab/dom-formula-assignment-using-ml",
trust_remote_code=True,
model_name="RandomForest",
).eval()
features = torch.tensor(
[
[191.03498, 0.6808667247437136, 146.2048497276655],
[191.071388, 0.7087887538442335, 152.19879035313514],
[191.10775, 0.6935974811483341, 148.93494377261592],
],
dtype=torch.float64,
)
with torch.no_grad():
outputs = model(features=features)
print("Predicted formulas:", outputs.formulas)
print("Predicted CHONS counts:", outputs.formula_counts)
```
The CHONS output columns are ordered as:
```text
[C, H, O, N, S]
```
## KNN Models
This repository contains pre-trained KNN models used for molecular formula assignment. The models vary by:
- **Dataset**: L1 (7T), L3 (21T), L1-L3, Synthetic
- **Neighbors (K)**: 1, 3
- **Distance Metric**: Euclidean, Manhattan
- **Field Strength/Type**: Ensemble, 7T, 21T, SYN (Synthetic)
### Model List
| Model Name | Description |
|---|---|
| `L1_K1_Euclidean` | L1 (7T) KNN model with 1 neighbor and Euclidean distance |
| `L1_K1_Manhattan` | L1 (7T) KNN model with 1 neighbor and Manhattan distance |
| `L1_K3_Euclidean` | L1 (7T) KNN model with 3 neighbors and Euclidean distance |
| `L1_K3_Manhattan` | L1 (7T) KNN model with 3 neighbors and Manhattan distance |
| `L3_K1_Euclidean` | L3 (21T) KNN model with 1 neighbor and Euclidean distance |
| `L3_K1_Manhattan` | L3 (21T) KNN model with 1 neighbor and Manhattan distance |
| `L3_K3_Euclidean` | L3 (21T) KNN model with 3 neighbors and Euclidean distance |
| `L3_K3_Manhattan` | L3 (21T) KNN model with 3 neighbors and Manhattan distance |
| `L1-L3_K1_Euclidean_Ensemble` | Ensemble of L1 (7T) and L3 (21T) KNN models with 1 neighbor and Euclidean distance |
| `L1-L3_K1_Manhattan_Ensemble` | Ensemble of L1 (7T) and L3 (21T) KNN models with 1 neighbor and Manhattan distance |
| `L1-L3_K3_Euclidean_Ensemble` | Ensemble of L1 (7T) and L3 (21T) KNN models with 3 neighbors and Euclidean distance |
| `L1-L3_K3_Manhattan_Ensemble` | Ensemble of L1 (7T) and L3 (21T) KNN models with 3 neighbors and Manhattan distance |
| `Synthetic_K1_Euclidean_Ensemble` | Ensemble of Synthetic 7T, 21T, and SYN KNN models with 1 neighbor and Euclidean distance |
| `Synthetic_K1_Manhattan_Ensemble` | Ensemble of Synthetic 7T, 21T, and SYN KNN models with 1 neighbor and Manhattan distance |
| `Synthetic_K3_Euclidean_Ensemble` | Ensemble of Synthetic 7T, 21T, and SYN KNN models with 3 neighbors and Euclidean distance |
| `Synthetic_K3_Manhattan_Ensemble` | Ensemble of Synthetic 7T, 21T, and SYN KNN models with 3 neighbors and Manhattan distance |
## Decision Tree and Random Forest Models
| Model Name | Description |
|---|---|
| `DecisionTree` | Predicts CHONS counts from `mz`, `inv_k0`, and `ccs` |
| `RandomForest` | Predicts CHONS counts from `mz`, `inv_k0`, and `ccs` |
## License
This model and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of this model and its derivatives, which include models trained on outputs from the model or datasets created from the model, is prohibited and requires prior approval. Downloading the model requires prior registration on Hugging Face and agreeing to the terms of use. By downloading this model, you agree not to distribute, publish or reproduce a copy of the model. If another user within your organization wishes to use the model, they must register as an individual user and agree to comply with the terms of use. Users may not attempt to re-identify the deidentified data used to develop the underlying model. If you are a commercial entity, please contact the corresponding author.
---
## Contact
For any additional questions or comments, contact Fahad Saeed (fsaeed@fiu.edu).
---
|