bilalsm commited on
Commit
0a52369
·
verified ·
1 Parent(s): 8abb0c2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +25 -13
README.md CHANGED
@@ -1,20 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
  # DOM Formula Assignment using K-Nearest Neighbors
2
 
3
 
4
  ![Model Type](https://img.shields.io/badge/Model-KNN-blue)
5
  ![Data](https://img.shields.io/badge/Data-FT--ICR_MS-green)
6
- ![License](https://img.shields.io/badge/License-MIT-yellow)
7
  [![GitHub](https://img.shields.io/badge/GitHub-pcdslab/dom--formula--assignment--using--ml-blue?logo=github)](https://github.com/pcdslab/dom-formula-assignment-using-ml)
8
 
9
- **K-Nearest Neighbors models for assigning molecular formulas to Dissolved Organic Matter (DOM) from FT-ICR MS data.**
10
 
11
- > 📄 **Paper**: Under review
12
 
13
  ---
14
 
15
- ## Overview
 
16
 
17
- This repository provides **16 pre-trained KNN model variants** for DOM formula assignment from Fourier Transform Ion Cyclotron Resonance Mass Spectrometry (FT-ICR MS) data. Models achieve **up to 99.98% assignment rate**, with 7T-21T ensembles providing the best performance on real-world DOM samples (**95.4% with high confidence**).
18
 
19
  ![Architecture](architecture.png)
20
 
@@ -58,12 +70,12 @@ Results on combined test sets (Suwannee River Fulvic Acid + Pahokee River Fulvic
58
 
59
  | Model | True Predictions | New Assignments | False Predictions | **Assignment Rate** |
60
  |-------|-----------------|-----------------|-------------------|---------------------|
61
- | **Synthetic (K=1, Euclidean)** | 2,623 | 1,423 | 1 | **99.975%** 🔬 |
62
- | **Synthetic (K=1, Manhattan)** | 2,623 | 1,423 | 1 | **99.975%** 🔬 |
63
- | **Synthetic (K=3, Euclidean)** | 2,631 | 1,415 | 1 | **99.975%** 🔬 |
64
- | **Synthetic (K=3, Manhattan)** | 2,631 | 1,415 | 1 | **99.975%** 🔬 |
65
- | **7T-21T (K=1, Euclidean)** | 3,851 | 8 | 188 | **95.355%** |
66
- | **7T-21T (K=1, Manhattan)** | 3,851 | 8 | 188 | **95.355%** |
67
  | **7T-21T (K=3, Euclidean)** | 3,846 | 10 | 191 | **95.280%** |
68
  | **7T-21T (K=3, Manhattan)** | 3,846 | 10 | 191 | **95.280%** |
69
  | 21T (K=1, Euclidean) | 3,835 | 10 | 202 | 95.009% |
@@ -76,8 +88,8 @@ Results on combined test sets (Suwannee River Fulvic Acid + Pahokee River Fulvic
76
  | 7T (K=3, Manhattan) | 3,201 | 6 | 840 | 79.244% |
77
 
78
  **Key Findings**:
79
- - 🔬 **Synthetic models** achieve highest assignment rate (99.975%) and make many new predictions (1,423 novel formulas)
80
- - **7T-21T ensemble models** provide best performance for real DOM samples (95.4% with only 8 new assignments)
81
  - **Recommended for most users**: 7T-21T ensemble (K=1) - optimal balance of accuracy and confidence
82
 
83
  ---
 
1
+ ---
2
+ license: cc-by-nc-nd-4.0
3
+ tags:
4
+ - mass-spectrometry
5
+ - molecular-formula
6
+ - dissolved-organic-matter
7
+ - knn
8
+ - scikit-learn
9
+ library_name: sklearn
10
+ pipeline_tag: formula-assignment
11
+ ---
12
+
13
  # DOM Formula Assignment using K-Nearest Neighbors
14
 
15
 
16
  ![Model Type](https://img.shields.io/badge/Model-KNN-blue)
17
  ![Data](https://img.shields.io/badge/Data-FT--ICR_MS-green)
18
+ ![License](https://img.shields.io/badge/License-CC_BY_NC_ND_4-yellow)
19
  [![GitHub](https://img.shields.io/badge/GitHub-pcdslab/dom--formula--assignment--using--ml-blue?logo=github)](https://github.com/pcdslab/dom-formula-assignment-using-ml)
20
 
21
+ **A Machine Learning Approach to Enhanced Molecular Formula Assignment in Fulvic Acid DOM Mass Spectra**
22
 
23
+ > **Paper**: Under review
24
 
25
  ---
26
 
27
+ ## Abstract
28
+ Dissolved organic matter (DOM) is a critical component of aquatic ecosystems, with the fulvic acid fraction (FA-DOM) exhibiting high mobility and ready bioavailability to microbial communities. While understanding the molecular composition is a vital area of study, the heterogeneity of the material, with a vast number of diverse compounds, makes this task challenging. Existing methods often struggle with incomplete formula assignment or reduced coverage highlighting the need for a better approach. In this study, we developed a machine learning approach using the k-nearest neighbors (KNN) algorithm to predict molecular formulas from ultra-high-resolution mass spectrometry data. The model was trained on chemical formulas assigned to multiple DOM samples using 7 Tesla(7T) and a 21 Tesla(21T) Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) system, and tested on an independent 9.4 T FT-ICR MS Fulvic Acid dataset. A synthetic dataset of plausible elemental combinations (C, H, O, N, S) was also generated to enhance generalization. Our approach achieved a 99.9% assignment rate on the labeled test set and assigned a total of 13,605 formulas for unlabeled peaks compared to the existing approach, which assigned 5914 formulas, achieving up to a 2.3X improvement in formula assignment coverage compared to existing methods.
29
 
 
30
 
31
  ![Architecture](architecture.png)
32
 
 
70
 
71
  | Model | True Predictions | New Assignments | False Predictions | **Assignment Rate** |
72
  |-------|-----------------|-----------------|-------------------|---------------------|
73
+ | **Synthetic (K=1, Euclidean)** | 2,623 | 1,423 | 1 | **99.975%** |
74
+ | **Synthetic (K=1, Manhattan)** | 2,623 | 1,423 | 1 | **99.975%** |
75
+ | **Synthetic (K=3, Euclidean)** | 2,631 | 1,415 | 1 | **99.975%** |
76
+ | **Synthetic (K=3, Manhattan)** | 2,631 | 1,415 | 1 | **99.975%** |
77
+ | **7T-21T (K=1, Euclidean)** | 3,851 | 8 | 188 | **95.355%** |
78
+ | **7T-21T (K=1, Manhattan)** | 3,851 | 8 | 188 | **95.355%** |
79
  | **7T-21T (K=3, Euclidean)** | 3,846 | 10 | 191 | **95.280%** |
80
  | **7T-21T (K=3, Manhattan)** | 3,846 | 10 | 191 | **95.280%** |
81
  | 21T (K=1, Euclidean) | 3,835 | 10 | 202 | 95.009% |
 
88
  | 7T (K=3, Manhattan) | 3,201 | 6 | 840 | 79.244% |
89
 
90
  **Key Findings**:
91
+ - **Synthetic models** achieve highest assignment rate (99.975%) and make many new predictions (1,423 novel formulas)
92
+ - **7T-21T ensemble models** provide best performance for real DOM samples (95.4% with only 8 new assignments)
93
  - **Recommended for most users**: 7T-21T ensemble (K=1) - optimal balance of accuracy and confidence
94
 
95
  ---