bilalsm commited on
Commit
f8e0289
·
verified ·
1 Parent(s): d2db6fc

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +180 -0
README.md ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DOM Formula Assignment using K-Nearest Neighbors
2
+
3
+
4
+ ![Model Type](https://img.shields.io/badge/Model-KNN-blue)
5
+ ![Data](https://img.shields.io/badge/Data-FT--ICR_MS-green)
6
+ ![License](https://img.shields.io/badge/License-MIT-yellow)
7
+ [![GitHub](https://img.shields.io/badge/GitHub-pcdslab/dom--formula--assignment--using--ml-blue?logo=github)](https://github.com/pcdslab/dom-formula-assignment-using-ml)
8
+
9
+ **K-Nearest Neighbors models for assigning molecular formulas to Dissolved Organic Matter (DOM) from FT-ICR MS data.**
10
+
11
+ > 📄 **Paper**: Under review
12
+
13
+ ---
14
+
15
+ ## Overview
16
+
17
+ This repository provides **16 pre-trained KNN model variants** for DOM formula assignment from Fourier Transform Ion Cyclotron Resonance Mass Spectrometry (FT-ICR MS) data. Models achieve **up to 99.98% assignment rate**, with 7T-21T ensembles providing the best performance on real-world DOM samples (**95.4% with high confidence**).
18
+
19
+ ![Architecture](architecture.png)
20
+
21
+ ---
22
+
23
+ ## Model Variants
24
+
25
+ ### Single Models (8 variants)
26
+ Trained on individual datasets (7T or 21T FT-ICR MS data):
27
+
28
+ | Data Source | K | Metric | Variant Name |
29
+ |-------------|---|--------|--------------|
30
+ | 7T | 1 | Euclidean | `knn_7T_k1_euclidean` |
31
+ | 7T | 1 | Manhattan | `knn_7T_k1_manhattan` |
32
+ | 7T | 3 | Euclidean | `knn_7T_k3_euclidean` |
33
+ | 7T | 3 | Manhattan | `knn_7T_k3_manhattan` |
34
+ | 21T | 1 | Euclidean | `knn_21T_k1_euclidean` |
35
+ | 21T | 1 | Manhattan | `knn_21T_k1_manhattan` |
36
+ | 21T | 3 | Euclidean | `knn_21T_k3_euclidean` |
37
+ | 21T | 3 | Manhattan | `knn_21T_k3_manhattan` |
38
+
39
+ ### Ensemble Models (8 variants)
40
+ Each combines multiple sub-models trained on different data versions:
41
+
42
+ | Data Source | K | Metric | Variant Name | Sub-models |
43
+ |-------------|---|--------|--------------|------------|
44
+ | **7T-21T** | 1 | Euclidean | `knn_7T21T_k1_euclidean_ensemble` | 2 (ver2+ver3) |
45
+ | **7T-21T** | 1 | Manhattan | `knn_7T21T_k1_manhattan_ensemble` | 2 (ver2+ver3) |
46
+ | **7T-21T** | 3 | Euclidean | `knn_7T21T_k3_euclidean_ensemble` | 2 (ver2+ver3) |
47
+ | **7T-21T** | 3 | Manhattan | `knn_7T21T_k3_manhattan_ensemble` | 2 (ver2+ver3) |
48
+ | **Synthetic** | 1 | Euclidean | `knn_Synthetic_k1_euclidean_ensemble` | 3 (ver2+ver3+synth) |
49
+ | **Synthetic** | 1 | Manhattan | `knn_Synthetic_k1_manhattan_ensemble` | 3 (ver2+ver3+synth) |
50
+ | **Synthetic** | 3 | Euclidean | `knn_Synthetic_k3_euclidean_ensemble` | 3 (ver2+ver3+synth) |
51
+ | **Synthetic** | 3 | Manhattan | `knn_Synthetic_k3_manhattan_ensemble` | 3 (ver2+ver3+synth) |
52
+
53
+ ---
54
+
55
+ ## Performance
56
+
57
+ Results on combined test sets (Suwannee River Fulvic Acid + Pahokee River Fulvic Acid + others):
58
+
59
+ | Model | True Predictions | New Assignments | False Predictions | **Assignment Rate** |
60
+ |-------|-----------------|-----------------|-------------------|---------------------|
61
+ | **Synthetic (K=1, Euclidean)** | 2,623 | 1,423 | 1 | **99.975%** 🔬 |
62
+ | **Synthetic (K=1, Manhattan)** | 2,623 | 1,423 | 1 | **99.975%** 🔬 |
63
+ | **Synthetic (K=3, Euclidean)** | 2,631 | 1,415 | 1 | **99.975%** 🔬 |
64
+ | **Synthetic (K=3, Manhattan)** | 2,631 | 1,415 | 1 | **99.975%** 🔬 |
65
+ | **7T-21T (K=1, Euclidean)** | 3,851 | 8 | 188 | **95.355%** ⭐ |
66
+ | **7T-21T (K=1, Manhattan)** | 3,851 | 8 | 188 | **95.355%** ⭐ |
67
+ | **7T-21T (K=3, Euclidean)** | 3,846 | 10 | 191 | **95.280%** |
68
+ | **7T-21T (K=3, Manhattan)** | 3,846 | 10 | 191 | **95.280%** |
69
+ | 21T (K=1, Euclidean) | 3,835 | 10 | 202 | 95.009% |
70
+ | 21T (K=1, Manhattan) | 3,835 | 10 | 202 | 95.009% |
71
+ | 21T (K=3, Euclidean) | 3,831 | 11 | 205 | 94.935% |
72
+ | 21T (K=3, Manhattan) | 3,831 | 11 | 205 | 94.935% |
73
+ | 7T (K=1, Euclidean) | 3,201 | 6 | 840 | 79.244% |
74
+ | 7T (K=1, Manhattan) | 3,201 | 6 | 840 | 79.244% |
75
+ | 7T (K=3, Euclidean) | 3,201 | 6 | 840 | 79.244% |
76
+ | 7T (K=3, Manhattan) | 3,201 | 6 | 840 | 79.244% |
77
+
78
+ **Key Findings**:
79
+ - 🔬 **Synthetic models** achieve highest assignment rate (99.975%) and make many new predictions (1,423 novel formulas)
80
+ - ⭐ **7T-21T ensemble models** provide best performance for real DOM samples (95.4% with only 8 new assignments)
81
+ - **Recommended for most users**: 7T-21T ensemble (K=1) - optimal balance of accuracy and confidence
82
+
83
+ ---
84
+
85
+ ## Quick Start
86
+
87
+ ### Installation
88
+
89
+ ```bash
90
+ pip install transformers huggingface_hub joblib scikit-learn
91
+ ```
92
+
93
+ ### Load Default Model
94
+
95
+ ```python
96
+ from transformers import AutoModel
97
+ import numpy as np
98
+
99
+ # Load best model (7T-21T, K=1, Euclidean)
100
+ model = AutoModel.from_pretrained(
101
+ "pcdslab/dom-knn-models",
102
+ trust_remote_code=True
103
+ )
104
+
105
+ # Prepare mass data
106
+ masses = np.array([[245.1234], [387.2156], [512.3478]])
107
+
108
+ # Get formula predictions
109
+ predictions = model(masses)
110
+ print(predictions)
111
+ # Output: ['C12H15O6' 'C20H31O8' 'C28H48O9']
112
+ ```
113
+
114
+ ### Load Specific Variant
115
+
116
+ ```python
117
+ # Load 21T model with K=1 and Euclidean distance
118
+ model = AutoModel.from_pretrained(
119
+ "pcdslab/dom-knn-models",
120
+ data_source="21T",
121
+ k_neighbors=1,
122
+ metric="euclidean",
123
+ trust_remote_code=True
124
+ )
125
+
126
+ # Load 7T-21T ensemble (automatically loads 2 sub-models)
127
+ model = AutoModel.from_pretrained(
128
+ "pcdslab/dom-knn-models",
129
+ data_source="7T-21T",
130
+ k_neighbors=1,
131
+ metric="euclidean",
132
+ trust_remote_code=True
133
+ )
134
+ ```
135
+
136
+ ### Batch Prediction
137
+
138
+ ```python
139
+ import pandas as pd
140
+
141
+ # Load your peak list
142
+ peaks = pd.read_csv("my_peaks.csv")
143
+ masses = peaks['m/z'].values.reshape(-1, 1)
144
+
145
+ # Predict formulas
146
+ formulas = model(masses)
147
+
148
+ # Add to dataframe
149
+ peaks['formula'] = formulas
150
+ peaks.to_csv("annotated_peaks.csv", index=False)
151
+ ```
152
+
153
+ ---
154
+
155
+ ## Model Selection Guide
156
+
157
+ | Use Case | Recommended Model | Why? |
158
+ |----------|-------------------|------|
159
+ | **Real DOM samples (best overall)** | 7T-21T ensemble (K=1) | Highest verified accuracy (95.4%), minimal new assignments |
160
+ | **Maximum assignment rate** | Synthetic ensemble (K=1) | 99.98% assignment rate (note: makes many novel predictions) |
161
+ | **21T data only** | 21T (K=1, Euclidean) | Optimized for 21T instrument data |
162
+ | **7T data only** | 7T (K=1, Euclidean) | Optimized for 7T instrument data |
163
+ | **Synthetic/simulated data** | Synthetic ensemble | Trained on computationally generated formulas |
164
+
165
+
166
+
167
+
168
+
169
+ ## License
170
+
171
+ This model and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of this model and its derivatives, which include models trained on outputs from the model or datasets created from the model, is prohibited and requires prior approval. Downloading the model requires prior registration on Hugging Face and agreeing to the terms of use. By downloading this model, you agree not to distribute, publish or reproduce a copy of the model. If another user within your organization wishes to use the model, they must register as an individual user and agree to comply with the terms of use. Users may not attempt to re-identify the deidentified data used to develop the underlying model. If you are a commercial entity, please contact the corresponding author.
172
+
173
+ ---
174
+
175
+
176
+ ## Contact
177
+
178
+ For any additional questions or comments, contact Fahad Saeed (fsaeed@fiu.edu).
179
+
180
+ ---