Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,20 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# DOM Formula Assignment using K-Nearest Neighbors
|
| 2 |
|
| 3 |
|
| 4 |

|
| 5 |

|
| 6 |
-
](https://github.com/pcdslab/dom-formula-assignment-using-ml)
|
| 8 |
|
| 9 |
-
**
|
| 10 |
|
| 11 |
-
>
|
| 12 |
|
| 13 |
---
|
| 14 |
|
| 15 |
-
##
|
|
|
|
| 16 |
|
| 17 |
-
This repository provides **16 pre-trained KNN model variants** for DOM formula assignment from Fourier Transform Ion Cyclotron Resonance Mass Spectrometry (FT-ICR MS) data. Models achieve **up to 99.98% assignment rate**, with 7T-21T ensembles providing the best performance on real-world DOM samples (**95.4% with high confidence**).
|
| 18 |
|
| 19 |

|
| 20 |
|
|
@@ -58,12 +70,12 @@ Results on combined test sets (Suwannee River Fulvic Acid + Pahokee River Fulvic
|
|
| 58 |
|
| 59 |
| Model | True Predictions | New Assignments | False Predictions | **Assignment Rate** |
|
| 60 |
|-------|-----------------|-----------------|-------------------|---------------------|
|
| 61 |
-
| **Synthetic (K=1, Euclidean)** | 2,623 | 1,423 | 1 | **99.975%**
|
| 62 |
-
| **Synthetic (K=1, Manhattan)** | 2,623 | 1,423 | 1 | **99.975%**
|
| 63 |
-
| **Synthetic (K=3, Euclidean)** | 2,631 | 1,415 | 1 | **99.975%**
|
| 64 |
-
| **Synthetic (K=3, Manhattan)** | 2,631 | 1,415 | 1 | **99.975%**
|
| 65 |
-
| **7T-21T (K=1, Euclidean)** | 3,851 | 8 | 188 | **95.355%**
|
| 66 |
-
| **7T-21T (K=1, Manhattan)** | 3,851 | 8 | 188 | **95.355%**
|
| 67 |
| **7T-21T (K=3, Euclidean)** | 3,846 | 10 | 191 | **95.280%** |
|
| 68 |
| **7T-21T (K=3, Manhattan)** | 3,846 | 10 | 191 | **95.280%** |
|
| 69 |
| 21T (K=1, Euclidean) | 3,835 | 10 | 202 | 95.009% |
|
|
@@ -76,8 +88,8 @@ Results on combined test sets (Suwannee River Fulvic Acid + Pahokee River Fulvic
|
|
| 76 |
| 7T (K=3, Manhattan) | 3,201 | 6 | 840 | 79.244% |
|
| 77 |
|
| 78 |
**Key Findings**:
|
| 79 |
-
-
|
| 80 |
-
-
|
| 81 |
- **Recommended for most users**: 7T-21T ensemble (K=1) - optimal balance of accuracy and confidence
|
| 82 |
|
| 83 |
---
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-nd-4.0
|
| 3 |
+
tags:
|
| 4 |
+
- mass-spectrometry
|
| 5 |
+
- molecular-formula
|
| 6 |
+
- dissolved-organic-matter
|
| 7 |
+
- knn
|
| 8 |
+
- scikit-learn
|
| 9 |
+
library_name: sklearn
|
| 10 |
+
pipeline_tag: formula-assignment
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
# DOM Formula Assignment using K-Nearest Neighbors
|
| 14 |
|
| 15 |
|
| 16 |

|
| 17 |

|
| 18 |
+

|
| 19 |
[](https://github.com/pcdslab/dom-formula-assignment-using-ml)
|
| 20 |
|
| 21 |
+
**A Machine Learning Approach to Enhanced Molecular Formula Assignment in Fulvic Acid DOM Mass Spectra**
|
| 22 |
|
| 23 |
+
> **Paper**: Under review
|
| 24 |
|
| 25 |
---
|
| 26 |
|
| 27 |
+
## Abstract
|
| 28 |
+
Dissolved organic matter (DOM) is a critical component of aquatic ecosystems, with the fulvic acid fraction (FA-DOM) exhibiting high mobility and ready bioavailability to microbial communities. While understanding the molecular composition is a vital area of study, the heterogeneity of the material, with a vast number of diverse compounds, makes this task challenging. Existing methods often struggle with incomplete formula assignment or reduced coverage highlighting the need for a better approach. In this study, we developed a machine learning approach using the k-nearest neighbors (KNN) algorithm to predict molecular formulas from ultra-high-resolution mass spectrometry data. The model was trained on chemical formulas assigned to multiple DOM samples using 7 Tesla(7T) and a 21 Tesla(21T) Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) system, and tested on an independent 9.4 T FT-ICR MS Fulvic Acid dataset. A synthetic dataset of plausible elemental combinations (C, H, O, N, S) was also generated to enhance generalization. Our approach achieved a 99.9% assignment rate on the labeled test set and assigned a total of 13,605 formulas for unlabeled peaks compared to the existing approach, which assigned 5914 formulas, achieving up to a 2.3X improvement in formula assignment coverage compared to existing methods.
|
| 29 |
|
|
|
|
| 30 |
|
| 31 |

|
| 32 |
|
|
|
|
| 70 |
|
| 71 |
| Model | True Predictions | New Assignments | False Predictions | **Assignment Rate** |
|
| 72 |
|-------|-----------------|-----------------|-------------------|---------------------|
|
| 73 |
+
| **Synthetic (K=1, Euclidean)** | 2,623 | 1,423 | 1 | **99.975%** |
|
| 74 |
+
| **Synthetic (K=1, Manhattan)** | 2,623 | 1,423 | 1 | **99.975%** |
|
| 75 |
+
| **Synthetic (K=3, Euclidean)** | 2,631 | 1,415 | 1 | **99.975%** |
|
| 76 |
+
| **Synthetic (K=3, Manhattan)** | 2,631 | 1,415 | 1 | **99.975%** |
|
| 77 |
+
| **7T-21T (K=1, Euclidean)** | 3,851 | 8 | 188 | **95.355%** |
|
| 78 |
+
| **7T-21T (K=1, Manhattan)** | 3,851 | 8 | 188 | **95.355%** |
|
| 79 |
| **7T-21T (K=3, Euclidean)** | 3,846 | 10 | 191 | **95.280%** |
|
| 80 |
| **7T-21T (K=3, Manhattan)** | 3,846 | 10 | 191 | **95.280%** |
|
| 81 |
| 21T (K=1, Euclidean) | 3,835 | 10 | 202 | 95.009% |
|
|
|
|
| 88 |
| 7T (K=3, Manhattan) | 3,201 | 6 | 840 | 79.244% |
|
| 89 |
|
| 90 |
**Key Findings**:
|
| 91 |
+
- **Synthetic models** achieve highest assignment rate (99.975%) and make many new predictions (1,423 novel formulas)
|
| 92 |
+
- **7T-21T ensemble models** provide best performance for real DOM samples (95.4% with only 8 new assignments)
|
| 93 |
- **Recommended for most users**: 7T-21T ensemble (K=1) - optimal balance of accuracy and confidence
|
| 94 |
|
| 95 |
---
|