codybum commited on
Commit
101c626
·
verified ·
1 Parent(s): 88c24f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -118
README.md CHANGED
@@ -74,6 +74,127 @@ This model is intended for research purposes in the field of neuropathology.
74
  * **Primary Intended Uses:**
75
  * Classification of tissue samples based on the presence/severity of neuropathological changes.
76
  * Feature extraction for quantitative analysis of neuropathology.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ## How to Get Started with the Model
78
 
79
 
@@ -206,124 +327,6 @@ if __name__ == "__main__":
206
 
207
  ```
208
 
209
- ## Training Data
210
-
211
- * **Dataset(s):** The model was trained on data from the University of Kentucky.
212
- * **Name/Identifier:** UK Alzheimer's Disease Center Neuropathology Whole Slide Image Cohort [BDSA TEST v1.0]
213
- * **Source:** [UK-ADRC Neuropathology Lab at the University of Kentucky University of Kentucky](https://neuropathlab.createuky.net/)
214
- * **Description:** The dataset contained 61 hole slide images (WSIs) of human post-mortem brain tissue sections. Sections were stained with Hematoxylin and Eosin (H&E).
215
- * **Preprocessing:** WSIs were tiled into non-overlapping 224x224 pixel patches at multiple magnification levels (40x, 10x, 2.5x, and 1.25x). For each magnification level, a maximum of 1000 tiles per annotation label were extracted to ensure balanced representation across pathological features.
216
- * **Annotation :** "Regions of interest (ROIs) for Gray Matter, White Matter, Leptomeninges, Exclude and Superficial Cortex were annotated. Annotations completed by Allison Neltner using a [web-based tool](https://github.com/pitt-bdsa/webapps) developed my Thomas Pearce, MD (UMPC).
217
-
218
- ## Training Procedure
219
-
220
- * **Training System/Framework:** DINO-MX (Modular & Flexible Self-Supervised Training Framework)
221
- * **Training Infrastructure:** 4 x DGS H100 nodes (32 x H100 GPUs)
222
- * **Base Model (if fine-tuning):** Pretrained `facebook/dinov2-giant` loaded from Hugging Face Hub.
223
- * **Training Objective(s):** Self-supervised learning using DINO loss, iBOT masked-image modeling loss.
224
- * **Key Hyperparameters (example):**
225
- * Batch size: 32
226
- * Learning rate: 1.0e-4
227
- * Epochs/Iterations: 5000 Iterations
228
- * Optimizer: AdamW
229
- * Weight decay: 0.04-0.4
230
-
231
- ## Evaluation
232
-
233
- * **Task(s):** Classification, KNN, Clustering, Robustness
234
- * **Metrics:** Accuracy, Precision, Recall, F1
235
- * **Dataset(s):** Neuro Path dataset
236
- * **Results:**
237
- The model achieved strong performance across multiple evaluation methods using the Neuro Path dataset. The model architecture is based on facebook/dinov2-giant.
238
-
239
- **Linear Probe Performance:**
240
- - Accuracy: 80.17%
241
- - Precision: 79.20%
242
- - Recall: 79.60%
243
- - F1 Score: 77.88%
244
-
245
- **K-Nearest Neighbors Classification:**
246
- - Accuracy: 83.76%
247
- - Precision: 83.34%
248
- - Recall: 83.76%
249
- - F1 Score: 83.40%
250
-
251
- **Clustering Quality:**
252
- - Silhouette Score: 0.267
253
- - Adjusted Mutual Information: 0.473
254
-
255
- **Robustness Score:** 0.574
256
-
257
- **Overall Performance Score:** 0.646
258
-
259
- ### Model Comparison
260
-
261
- #### Models Evaluated
262
- * **NP-TEST-0:** Our model
263
- * **dinov2-giant:** Pretrained [Dinov2 Giant](https://huggingface.co/facebook/dinov2-giant)
264
- * **dinov2-giant_distilled_prov:** [Dinov2 Giant](https://huggingface.co/facebook/dinov2-giant) distilled from [provo-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
265
- * **dinov2-large_distilled_prov:** [Dinov2 Large](https://huggingface.co/facebook/dinov2-large) distilled from [provo-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
266
- * **distilled_prov_finetuned:** dinov2-giant_distilled_prov was used as a base with additional finetuning without freezing teacher model.
267
- * **prov-gigapath:** [prov-gigapath/prov-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
268
- * **UNI:** [MahmoodLab/UNI](https://huggingface.co/MahmoodLab/UNI)
269
- * **UNI2-h:** [MahmoodLab/UNI2-h](https://huggingface.co/MahmoodLab/UNI2-h)
270
-
271
- #### Linear Probe Comparison
272
- | Model | Accuracy | F1 | Precision | Recall |
273
- |---|---|---|---|---|
274
- | NP-TEST-0 | *0.802* | *0.779* | *0.792* | *0.796* |
275
- | dinov2-giant | 0.667 | 0.648 | 0.669 | 0.667 |
276
- | dinov2-giant_distilled_prov | 0.769 | 0.756 | 0.755 | 0.769 |
277
- | dinov2-large_distilled_prov | 0.772 | 0.758 | 0.758 | 0.772 |
278
- | distilled_prov_finetuned | 0.779 | 0.762 | 0.770 | 0.779 |
279
- | prov-gigapath | 0.776 | 0.762 | 0.764 | 0.776 |
280
- | UNI | 0.741 | 0.731 | 0.734 | 0.741 |
281
- | UNI2-h | 0.768 | 0.750 | 0.753 | 0.768 |
282
-
283
- <img src="model_compare_radar.png" alt="chart" width="800"/>
284
-
285
- ### Model Evaluation Details
286
-
287
- The radar chart provides a visual comparison of multiple models across several performance metrics. Each axis extending from the center represents a different metric. The farther a model's line is from the center along a particular axis, the better its score for that specific metric (assuming higher is better for the metric).
288
-
289
- **How to Interpret:**
290
-
291
- * **Axes:** Each spoke of the radar represents a distinct evaluation metric.
292
- * **Lines/Polygons:** Each colored line (forming a polygon) represents a different model.
293
- * **Performance:** A point on an axis closer to the outer edge indicates a higher score for that metric.
294
- * **Overall Comparison:** By comparing the shapes and sizes of the polygons, you can get a quick visual understanding of the strengths and weaknesses of each model relative to others. A larger overall polygon generally suggests better all-around performance on the displayed metrics.
295
-
296
- ---
297
-
298
- **Tests**
299
-
300
- #### 1. Linear Probe
301
-
302
- * **What it is:** This test evaluates the quality of the model's learned features (embeddings). A simple linear classifier is trained on top of these frozen features to perform a classification task.
303
- * **Purpose:** It assesses how well the learned representations can be used for downstream tasks with a minimal amount of additional training. Good performance indicates that the embeddings are linearly separable and capture meaningful information.
304
- * **Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the linear classifier).
305
-
306
- #### 2. K-Nearest Neighbors (KNN) Evaluation
307
-
308
- * **What it is:** This test also evaluates the quality of the model's embeddings. Instead of training a new classifier, it uses the K-Nearest Neighbors algorithm directly on the embeddings to make predictions. For a given data point, its class is determined by the majority class among its 'k' closest neighbors in the embedding space.
309
- * **Purpose:** It assesses the local structure and similarity relationships within the embedding space. Good KNN performance suggests that similar items are close to each other in the learned representation.
310
- * **Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the KNN classifier).
311
-
312
- #### 3. Clustering
313
-
314
- * **What it is:** This set of tests evaluates how well the model's embeddings can naturally group similar items together without predefined labels (unsupervised). Algorithms like K-Means are often used to partition the data points based on their embeddings.
315
- * **Purpose:** It assesses the intrinsic structure and separability of the learned representations into meaningful groups.
316
- * **Common Metrics:**
317
- * **Silhouette Score:** Measures how similar an object is to its own cluster compared to other clusters. Ranges from -1 to 1 (higher is better).
318
- * **Adjusted Mutual Information (AMI):** Measures the agreement between true labels (if available) and clustering assignments, adjusted for chance. Ranges from 0 to 1 (higher is better).
319
-
320
- #### 4. Robustness
321
-
322
- * **What it is:** This is a general category of tests designed to measure how well a model maintains its performance when faced with various challenges or changes in the input data.
323
- * **Purpose:** It assesses the model's stability and reliability under non-ideal conditions.
324
- * **Examples of Challenges:** This can include noisy data, adversarial attacks (inputs intentionally designed to fool the model), out-of-distribution samples (data different from what the model was trained on), or other perturbations.
325
- * **Common Metrics:** Often a "Robustness Score" is reported, which could be an accuracy, F1-score, or other relevant metric evaluated on the challenged dataset. The specific calculation depends on the nature of the robustness test. (Higher is generally better).
326
-
327
  ---
328
 
329
  **Acknowledgements:**
 
74
  * **Primary Intended Uses:**
75
  * Classification of tissue samples based on the presence/severity of neuropathological changes.
76
  * Feature extraction for quantitative analysis of neuropathology.
77
+
78
+ ## Training Data
79
+
80
+ * **Dataset(s):** The model was trained on data from the University of Kentucky.
81
+ * **Name/Identifier:** UK Alzheimer's Disease Center Neuropathology Whole Slide Image Cohort [BDSA TEST v1.0]
82
+ * **Source:** [UK-ADRC Neuropathology Lab at the University of Kentucky University of Kentucky](https://neuropathlab.createuky.net/)
83
+ * **Description:** The dataset contained 61 hole slide images (WSIs) of human post-mortem brain tissue sections. Sections were stained with Hematoxylin and Eosin (H&E).
84
+ * **Preprocessing:** WSIs were tiled into non-overlapping 224x224 pixel patches at multiple magnification levels (40x, 10x, 2.5x, and 1.25x). For each magnification level, a maximum of 1000 tiles per annotation label were extracted to ensure balanced representation across pathological features.
85
+ * **Annotation :** "Regions of interest (ROIs) for Gray Matter, White Matter, Leptomeninges, Exclude and Superficial Cortex were annotated. Annotations completed by Allison Neltner using a [web-based tool](https://github.com/pitt-bdsa/webapps) developed my Thomas Pearce, MD (UMPC).
86
+
87
+ ## Training Procedure
88
+
89
+ * **Training System/Framework:** DINO-MX (Modular & Flexible Self-Supervised Training Framework)
90
+ * **Training Infrastructure:** 4 x DGS H100 nodes (32 x H100 GPUs)
91
+ * **Base Model (if fine-tuning):** Pretrained `facebook/dinov2-giant` loaded from Hugging Face Hub.
92
+ * **Training Objective(s):** Self-supervised learning using DINO loss, iBOT masked-image modeling loss.
93
+ * **Key Hyperparameters (example):**
94
+ * Batch size: 32
95
+ * Learning rate: 1.0e-4
96
+ * Epochs/Iterations: 5000 Iterations
97
+ * Optimizer: AdamW
98
+ * Weight decay: 0.04-0.4
99
+
100
+ ## Evaluation
101
+
102
+ * **Task(s):** Classification, KNN, Clustering, Robustness
103
+ * **Metrics:** Accuracy, Precision, Recall, F1
104
+ * **Dataset(s):** Neuro Path dataset
105
+ * **Results:**
106
+ The model achieved strong performance across multiple evaluation methods using the Neuro Path dataset.
107
+
108
+ **Linear Probe Performance:**
109
+ - Accuracy: 80.17%
110
+ - Precision: 79.20%
111
+ - Recall: 79.60%
112
+ - F1 Score: 77.88%
113
+
114
+ **K-Nearest Neighbors Classification:**
115
+ - Accuracy: 83.76%
116
+ - Precision: 83.34%
117
+ - Recall: 83.76%
118
+ - F1 Score: 83.40%
119
+
120
+ **Clustering Quality:**
121
+ - Silhouette Score: 0.267
122
+ - Adjusted Mutual Information: 0.473
123
+
124
+ **Robustness Score:** 0.574
125
+
126
+ **Overall Performance Score:** 0.646
127
+
128
+ ### Model Comparison
129
+
130
+ #### Models Evaluated
131
+ * **NP-TEST-0:** Our model
132
+ * **dinov2-giant:** Pretrained [Dinov2 Giant](https://huggingface.co/facebook/dinov2-giant)
133
+ * **dinov2-giant_distilled_prov:** [Dinov2 Giant](https://huggingface.co/facebook/dinov2-giant) distilled from [provo-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
134
+ * **dinov2-large_distilled_prov:** [Dinov2 Large](https://huggingface.co/facebook/dinov2-large) distilled from [provo-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
135
+ * **distilled_prov_finetuned:** dinov2-giant_distilled_prov was used as a base with additional finetuning without freezing teacher model.
136
+ * **prov-gigapath:** [prov-gigapath/prov-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
137
+ * **UNI:** [MahmoodLab/UNI](https://huggingface.co/MahmoodLab/UNI)
138
+ * **UNI2-h:** [MahmoodLab/UNI2-h](https://huggingface.co/MahmoodLab/UNI2-h)
139
+
140
+ #### Linear Probe Comparison
141
+ | Model | Accuracy | F1 | Precision | Recall |
142
+ |---|---|---|---|---|
143
+ | NP-TEST-0 | *0.802* | *0.779* | *0.792* | *0.796* |
144
+ | dinov2-giant | 0.667 | 0.648 | 0.669 | 0.667 |
145
+ | dinov2-giant_distilled_prov | 0.769 | 0.756 | 0.755 | 0.769 |
146
+ | dinov2-large_distilled_prov | 0.772 | 0.758 | 0.758 | 0.772 |
147
+ | distilled_prov_finetuned | 0.779 | 0.762 | 0.770 | 0.779 |
148
+ | prov-gigapath | 0.776 | 0.762 | 0.764 | 0.776 |
149
+ | UNI | 0.741 | 0.731 | 0.734 | 0.741 |
150
+ | UNI2-h | 0.768 | 0.750 | 0.753 | 0.768 |
151
+
152
+ <img src="model_compare_radar.png" alt="chart" width="800"/>
153
+
154
+ ### Model Evaluation Details
155
+
156
+ The radar chart provides a visual comparison of multiple models across several performance metrics. Each axis extending from the center represents a different metric. The farther a model's line is from the center along a particular axis, the better its score for that specific metric (assuming higher is better for the metric).
157
+
158
+ **How to Interpret:**
159
+
160
+ * **Axes:** Each spoke of the radar represents a distinct evaluation metric.
161
+ * **Lines/Polygons:** Each colored line (forming a polygon) represents a different model.
162
+ * **Performance:** A point on an axis closer to the outer edge indicates a higher score for that metric.
163
+ * **Overall Comparison:** By comparing the shapes and sizes of the polygons, you can get a quick visual understanding of the strengths and weaknesses of each model relative to others. A larger overall polygon generally suggests better all-around performance on the displayed metrics.
164
+
165
+ ---
166
+
167
+ **Tests**
168
+
169
+ #### 1. Linear Probe
170
+
171
+ * **What it is:** This test evaluates the quality of the model's learned features (embeddings). A simple linear classifier is trained on top of these frozen features to perform a classification task.
172
+ * **Purpose:** It assesses how well the learned representations can be used for downstream tasks with a minimal amount of additional training. Good performance indicates that the embeddings are linearly separable and capture meaningful information.
173
+ * **Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the linear classifier).
174
+
175
+ #### 2. K-Nearest Neighbors (KNN) Evaluation
176
+
177
+ * **What it is:** This test also evaluates the quality of the model's embeddings. Instead of training a new classifier, it uses the K-Nearest Neighbors algorithm directly on the embeddings to make predictions. For a given data point, its class is determined by the majority class among its 'k' closest neighbors in the embedding space.
178
+ * **Purpose:** It assesses the local structure and similarity relationships within the embedding space. Good KNN performance suggests that similar items are close to each other in the learned representation.
179
+ * **Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the KNN classifier).
180
+
181
+ #### 3. Clustering
182
+
183
+ * **What it is:** This set of tests evaluates how well the model's embeddings can naturally group similar items together without predefined labels (unsupervised). Algorithms like K-Means are often used to partition the data points based on their embeddings.
184
+ * **Purpose:** It assesses the intrinsic structure and separability of the learned representations into meaningful groups.
185
+ * **Common Metrics:**
186
+ * **Silhouette Score:** Measures how similar an object is to its own cluster compared to other clusters. Ranges from -1 to 1 (higher is better).
187
+ * **Adjusted Mutual Information (AMI):** Measures the agreement between true labels (if available) and clustering assignments, adjusted for chance. Ranges from 0 to 1 (higher is better).
188
+
189
+ #### 4. Robustness
190
+
191
+ * **What it is:** This is a general category of tests designed to measure how well a model maintains its performance when faced with various challenges or changes in the input data.
192
+ * **Purpose:** It assesses the model's stability and reliability under non-ideal conditions.
193
+ * **Examples of Challenges:** This can include noisy data, adversarial attacks (inputs intentionally designed to fool the model), out-of-distribution samples (data different from what the model was trained on), or other perturbations.
194
+ * **Common Metrics:** Often a "Robustness Score" is reported, which could be an accuracy, F1-score, or other relevant metric evaluated on the challenged dataset. The specific calculation depends on the nature of the robustness test. (Higher is generally better).
195
+
196
+ ---
197
+
198
  ## How to Get Started with the Model
199
 
200
 
 
327
 
328
  ```
329
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
330
  ---
331
 
332
  **Acknowledgements:**