bilalsm
/

dom-formula-assignment-using-knn

@@ -1,20 +1,32 @@
 # DOM Formula Assignment using K-Nearest Neighbors
 ![Model Type](https://img.shields.io/badge/Model-KNN-blue)
 ![Data](https://img.shields.io/badge/Data-FT--ICR_MS-green)
-![License](https://img.shields.io/badge/License-MIT-yellow)
 [![GitHub](https://img.shields.io/badge/GitHub-pcdslab/dom--formula--assignment--using--ml-blue?logo=github)](https://github.com/pcdslab/dom-formula-assignment-using-ml)
-**K-Nearest Neighbors models for assigning molecular formulas to Dissolved Organic Matter (DOM) from FT-ICR MS data.**
-> 📄 **Paper**: Under review
 ---
-## Overview
-This repository provides **16 pre-trained KNN model variants** for DOM formula assignment from Fourier Transform Ion Cyclotron Resonance Mass Spectrometry (FT-ICR MS) data. Models achieve **up to 99.98% assignment rate**, with 7T-21T ensembles providing the best performance on real-world DOM samples (**95.4% with high confidence**).
 ![Architecture](architecture.png)
@@ -58,12 +70,12 @@ Results on combined test sets (Suwannee River Fulvic Acid + Pahokee River Fulvic
 | Model | True Predictions | New Assignments | False Predictions | **Assignment Rate** |
 |-------|-----------------|-----------------|-------------------|---------------------|
-| **Synthetic (K=1, Euclidean)** | 2,623 | 1,423 | 1 | **99.975%** 🔬 |
-| **Synthetic (K=1, Manhattan)** | 2,623 | 1,423 | 1 | **99.975%** 🔬 |
-| **Synthetic (K=3, Euclidean)** | 2,631 | 1,415 | 1 | **99.975%** 🔬 |
-| **Synthetic (K=3, Manhattan)** | 2,631 | 1,415 | 1 | **99.975%** 🔬 |
-| **7T-21T (K=1, Euclidean)** | 3,851 | 8 | 188 | **95.355%** ⭐ |
-| **7T-21T (K=1, Manhattan)** | 3,851 | 8 | 188 | **95.355%** ⭐ |
 | **7T-21T (K=3, Euclidean)** | 3,846 | 10 | 191 | **95.280%** |
 | **7T-21T (K=3, Manhattan)** | 3,846 | 10 | 191 | **95.280%** |
 | 21T (K=1, Euclidean) | 3,835 | 10 | 202 | 95.009% |
@@ -76,8 +88,8 @@ Results on combined test sets (Suwannee River Fulvic Acid + Pahokee River Fulvic
 | 7T (K=3, Manhattan) | 3,201 | 6 | 840 | 79.244% |
 **Key Findings**:
-- 🔬 **Synthetic models** achieve highest assignment rate (99.975%) and make many new predictions (1,423 novel formulas)
-- ⭐ **7T-21T ensemble models** provide best performance for real DOM samples (95.4% with only 8 new assignments)
 - **Recommended for most users**: 7T-21T ensemble (K=1) - optimal balance of accuracy and confidence
 ---

+---
+license: cc-by-nc-nd-4.0
+tags:
+- mass-spectrometry
+- molecular-formula
+- dissolved-organic-matter
+- knn
+- scikit-learn
+library_name: sklearn
+pipeline_tag: formula-assignment
+---
 # DOM Formula Assignment using K-Nearest Neighbors
 ![Model Type](https://img.shields.io/badge/Model-KNN-blue)
 ![Data](https://img.shields.io/badge/Data-FT--ICR_MS-green)
+![License](https://img.shields.io/badge/License-CC_BY_NC_ND_4-yellow)
 [![GitHub](https://img.shields.io/badge/GitHub-pcdslab/dom--formula--assignment--using--ml-blue?logo=github)](https://github.com/pcdslab/dom-formula-assignment-using-ml)
+**A Machine Learning Approach to Enhanced Molecular Formula Assignment in Fulvic Acid DOM Mass Spectra**
+> **Paper**: Under review
 ---
+## Abstract
+Dissolved organic matter (DOM) is a critical component of aquatic ecosystems, with the fulvic acid fraction (FA-DOM) exhibiting high mobility and ready bioavailability to microbial communities. While understanding the molecular composition is a vital area of study, the heterogeneity of the material, with a vast number of diverse compounds, makes this task challenging. Existing methods often struggle with incomplete formula assignment or reduced coverage highlighting the need for a better approach. In this study, we developed a machine learning approach using the k-nearest neighbors (KNN) algorithm to predict molecular formulas from ultra-high-resolution mass spectrometry data. The model was trained on chemical formulas assigned to multiple DOM samples using 7 Tesla(7T) and a 21 Tesla(21T) Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) system, and tested on an independent 9.4 T FT-ICR MS Fulvic Acid dataset. A synthetic dataset of plausible elemental combinations (C, H, O, N, S) was also generated to enhance generalization. Our approach achieved a 99.9% assignment rate on the labeled test set and assigned a total of 13,605 formulas for unlabeled peaks compared to the existing approach, which assigned 5914 formulas, achieving up to a 2.3X improvement in formula assignment coverage compared to existing methods.
 ![Architecture](architecture.png)
 | Model | True Predictions | New Assignments | False Predictions | **Assignment Rate** |
 |-------|-----------------|-----------------|-------------------|---------------------|
+| **Synthetic (K=1, Euclidean)** | 2,623 | 1,423 | 1 | **99.975%** |
+| **Synthetic (K=1, Manhattan)** | 2,623 | 1,423 | 1 | **99.975%** |
+| **Synthetic (K=3, Euclidean)** | 2,631 | 1,415 | 1 | **99.975%** |
+| **Synthetic (K=3, Manhattan)** | 2,631 | 1,415 | 1 | **99.975%** |
+| **7T-21T (K=1, Euclidean)** | 3,851 | 8 | 188 | **95.355%** |
+| **7T-21T (K=1, Manhattan)** | 3,851 | 8 | 188 | **95.355%** |
 | **7T-21T (K=3, Euclidean)** | 3,846 | 10 | 191 | **95.280%** |
 | **7T-21T (K=3, Manhattan)** | 3,846 | 10 | 191 | **95.280%** |
 | 21T (K=1, Euclidean) | 3,835 | 10 | 202 | 95.009% |
 | 7T (K=3, Manhattan) | 3,201 | 6 | 840 | 79.244% |
 **Key Findings**:
+- **Synthetic models** achieve highest assignment rate (99.975%) and make many new predictions (1,423 novel formulas)
+- **7T-21T ensemble models** provide best performance for real DOM samples (95.4% with only 8 new assignments)
 - **Recommended for most users**: 7T-21T ensemble (K=1) - optimal balance of accuracy and confidence
 ---