Kentucky-Open-Science
/

NP-TEST-0

@@ -74,6 +74,127 @@ This model is intended for research purposes in the field of neuropathology.
 * **Primary Intended Uses:**
     * Classification of tissue samples based on the presence/severity of neuropathological changes.
     * Feature extraction for quantitative analysis of neuropathology.
 ## How to Get Started with the Model
@@ -206,124 +327,6 @@ if __name__ == "__main__":
 ```
-## Training Data
-* **Dataset(s):** The model was trained on data from the University of Kentucky.
-    * **Name/Identifier:** UK Alzheimer's Disease Center Neuropathology Whole Slide Image Cohort [BDSA TEST v1.0]
-    * **Source:** [UK-ADRC Neuropathology Lab at the University of Kentucky University of Kentucky](https://neuropathlab.createuky.net/)
-    * **Description:** The dataset contained 61 hole slide images (WSIs) of human post-mortem brain tissue sections. Sections were stained with Hematoxylin and Eosin (H&E).
-    * **Preprocessing:** WSIs were tiled into non-overlapping 224x224 pixel patches at multiple magnification levels (40x, 10x, 2.5x, and 1.25x). For each magnification level, a maximum of 1000 tiles per annotation label were extracted to ensure balanced representation across pathological features.
-    * **Annotation :** "Regions of interest (ROIs) for Gray Matter, White Matter, Leptomeninges, Exclude and Superficial Cortex were annotated. Annotations completed by Allison Neltner using a [web-based tool](https://github.com/pitt-bdsa/webapps) developed my Thomas Pearce, MD (UMPC).
-## Training Procedure
-* **Training System/Framework:** DINO-MX (Modular & Flexible Self-Supervised Training Framework)
-* **Training Infrastructure:** 4 x DGS H100 nodes (32 x H100 GPUs)
-* **Base Model (if fine-tuning):** Pretrained `facebook/dinov2-giant` loaded from Hugging Face Hub.
-* **Training Objective(s):** Self-supervised learning using DINO loss, iBOT masked-image modeling loss.
-* **Key Hyperparameters (example):**
-    * Batch size: 32
-    * Learning rate: 1.0e-4
-    * Epochs/Iterations: 5000 Iterations
-    * Optimizer: AdamW
-    * Weight decay: 0.04-0.4
-## Evaluation
-* **Task(s):** Classification, KNN, Clustering, Robustness
-* **Metrics:** Accuracy, Precision, Recall, F1
-* **Dataset(s):** Neuro Path dataset
-* **Results:**
-  The model achieved strong performance across multiple evaluation methods using the Neuro Path dataset. The model architecture is based on facebook/dinov2-giant.
-  **Linear Probe Performance:**
-  - Accuracy: 80.17%
-  - Precision: 79.20%
-  - Recall: 79.60%
-  - F1 Score: 77.88%
-  **K-Nearest Neighbors Classification:**
-  - Accuracy: 83.76%
-  - Precision: 83.34%
-  - Recall: 83.76%
-  - F1 Score: 83.40%
-  **Clustering Quality:**
-  - Silhouette Score: 0.267
-  - Adjusted Mutual Information: 0.473
-  **Robustness Score:** 0.574
-  **Overall Performance Score:** 0.646
-### Model Comparison
-#### Models Evaluated
-* **NP-TEST-0:** Our model
-* **dinov2-giant:** Pretrained [Dinov2 Giant](https://huggingface.co/facebook/dinov2-giant)
-* **dinov2-giant_distilled_prov:** [Dinov2 Giant](https://huggingface.co/facebook/dinov2-giant) distilled from [provo-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
-* **dinov2-large_distilled_prov:** [Dinov2 Large](https://huggingface.co/facebook/dinov2-large) distilled from [provo-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
-* **distilled_prov_finetuned:** dinov2-giant_distilled_prov was used as a base with additional finetuning without freezing teacher model.
-* **prov-gigapath:** [prov-gigapath/prov-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
-* **UNI:** [MahmoodLab/UNI](https://huggingface.co/MahmoodLab/UNI)
-* **UNI2-h:** [MahmoodLab/UNI2-h](https://huggingface.co/MahmoodLab/UNI2-h)
-#### Linear Probe Comparison
-| Model | Accuracy | F1 | Precision | Recall |
-|---|---|---|---|---|
-| NP-TEST-0 | *0.802* | *0.779* | *0.792* | *0.796* |
-| dinov2-giant | 0.667 | 0.648 | 0.669 | 0.667 |
-| dinov2-giant_distilled_prov | 0.769 | 0.756 | 0.755 | 0.769 |
-| dinov2-large_distilled_prov | 0.772 | 0.758 | 0.758 | 0.772 |
-| distilled_prov_finetuned | 0.779 | 0.762 | 0.770 | 0.779 |
-| prov-gigapath | 0.776 | 0.762 | 0.764 | 0.776 |
-| UNI | 0.741 | 0.731 | 0.734 | 0.741 |
-| UNI2-h | 0.768 | 0.750 | 0.753 | 0.768 |
-<img src="model_compare_radar.png" alt="chart" width="800"/>
-### Model Evaluation Details
-The radar chart provides a visual comparison of multiple models across several performance metrics. Each axis extending from the center represents a different metric. The farther a model's line is from the center along a particular axis, the better its score for that specific metric (assuming higher is better for the metric).
-**How to Interpret:**
-* **Axes:** Each spoke of the radar represents a distinct evaluation metric.
-* **Lines/Polygons:** Each colored line (forming a polygon) represents a different model.
-* **Performance:** A point on an axis closer to the outer edge indicates a higher score for that metric.
-* **Overall Comparison:** By comparing the shapes and sizes of the polygons, you can get a quick visual understanding of the strengths and weaknesses of each model relative to others. A larger overall polygon generally suggests better all-around performance on the displayed metrics.
----
-**Tests**
-#### 1. Linear Probe
-* **What it is:** This test evaluates the quality of the model's learned features (embeddings). A simple linear classifier is trained on top of these frozen features to perform a classification task.
-* **Purpose:** It assesses how well the learned representations can be used for downstream tasks with a minimal amount of additional training. Good performance indicates that the embeddings are linearly separable and capture meaningful information.
-* **Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the linear classifier).
-#### 2. K-Nearest Neighbors (KNN) Evaluation
-* **What it is:** This test also evaluates the quality of the model's embeddings. Instead of training a new classifier, it uses the K-Nearest Neighbors algorithm directly on the embeddings to make predictions. For a given data point, its class is determined by the majority class among its 'k' closest neighbors in the embedding space.
-* **Purpose:** It assesses the local structure and similarity relationships within the embedding space. Good KNN performance suggests that similar items are close to each other in the learned representation.
-* **Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the KNN classifier).
-#### 3. Clustering
-* **What it is:** This set of tests evaluates how well the model's embeddings can naturally group similar items together without predefined labels (unsupervised). Algorithms like K-Means are often used to partition the data points based on their embeddings.
-* **Purpose:** It assesses the intrinsic structure and separability of the learned representations into meaningful groups.
-* **Common Metrics:**
-    * **Silhouette Score:** Measures how similar an object is to its own cluster compared to other clusters. Ranges from -1 to 1 (higher is better).
-    * **Adjusted Mutual Information (AMI):** Measures the agreement between true labels (if available) and clustering assignments, adjusted for chance. Ranges from 0 to 1 (higher is better).
-#### 4. Robustness
-* **What it is:** This is a general category of tests designed to measure how well a model maintains its performance when faced with various challenges or changes in the input data.
-* **Purpose:** It assesses the model's stability and reliability under non-ideal conditions.
-* **Examples of Challenges:** This can include noisy data, adversarial attacks (inputs intentionally designed to fool the model), out-of-distribution samples (data different from what the model was trained on), or other perturbations.
-* **Common Metrics:** Often a "Robustness Score" is reported, which could be an accuracy, F1-score, or other relevant metric evaluated on the challenged dataset. The specific calculation depends on the nature of the robustness test. (Higher is generally better).
 ---
 **Acknowledgements:**

 * **Primary Intended Uses:**
     * Classification of tissue samples based on the presence/severity of neuropathological changes.
     * Feature extraction for quantitative analysis of neuropathology.
+## Training Data
+* **Dataset(s):** The model was trained on data from the University of Kentucky.
+    * **Name/Identifier:** UK Alzheimer's Disease Center Neuropathology Whole Slide Image Cohort [BDSA TEST v1.0]
+    * **Source:** [UK-ADRC Neuropathology Lab at the University of Kentucky University of Kentucky](https://neuropathlab.createuky.net/)
+    * **Description:** The dataset contained 61 hole slide images (WSIs) of human post-mortem brain tissue sections. Sections were stained with Hematoxylin and Eosin (H&E).
+    * **Preprocessing:** WSIs were tiled into non-overlapping 224x224 pixel patches at multiple magnification levels (40x, 10x, 2.5x, and 1.25x). For each magnification level, a maximum of 1000 tiles per annotation label were extracted to ensure balanced representation across pathological features.
+    * **Annotation :** "Regions of interest (ROIs) for Gray Matter, White Matter, Leptomeninges, Exclude and Superficial Cortex were annotated. Annotations completed by Allison Neltner using a [web-based tool](https://github.com/pitt-bdsa/webapps) developed my Thomas Pearce, MD (UMPC).
+## Training Procedure
+* **Training System/Framework:** DINO-MX (Modular & Flexible Self-Supervised Training Framework)
+* **Training Infrastructure:** 4 x DGS H100 nodes (32 x H100 GPUs)
+* **Base Model (if fine-tuning):** Pretrained `facebook/dinov2-giant` loaded from Hugging Face Hub.
+* **Training Objective(s):** Self-supervised learning using DINO loss, iBOT masked-image modeling loss.
+* **Key Hyperparameters (example):**
+    * Batch size: 32
+    * Learning rate: 1.0e-4
+    * Epochs/Iterations: 5000 Iterations
+    * Optimizer: AdamW
+    * Weight decay: 0.04-0.4
+## Evaluation
+* **Task(s):** Classification, KNN, Clustering, Robustness
+* **Metrics:** Accuracy, Precision, Recall, F1
+* **Dataset(s):** Neuro Path dataset
+* **Results:**
+  The model achieved strong performance across multiple evaluation methods using the Neuro Path dataset.
+  **Linear Probe Performance:**
+  - Accuracy: 80.17%
+  - Precision: 79.20%
+  - Recall: 79.60%
+  - F1 Score: 77.88%
+  **K-Nearest Neighbors Classification:**
+  - Accuracy: 83.76%
+  - Precision: 83.34%
+  - Recall: 83.76%
+  - F1 Score: 83.40%
+  **Clustering Quality:**
+  - Silhouette Score: 0.267
+  - Adjusted Mutual Information: 0.473
+  **Robustness Score:** 0.574
+  **Overall Performance Score:** 0.646
+### Model Comparison
+#### Models Evaluated
+* **NP-TEST-0:** Our model
+* **dinov2-giant:** Pretrained [Dinov2 Giant](https://huggingface.co/facebook/dinov2-giant)
+* **dinov2-giant_distilled_prov:** [Dinov2 Giant](https://huggingface.co/facebook/dinov2-giant) distilled from [provo-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
+* **dinov2-large_distilled_prov:** [Dinov2 Large](https://huggingface.co/facebook/dinov2-large) distilled from [provo-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
+* **distilled_prov_finetuned:** dinov2-giant_distilled_prov was used as a base with additional finetuning without freezing teacher model.
+* **prov-gigapath:** [prov-gigapath/prov-gigapath](https://huggingface.co/prov-gigapath/prov-gigapath)
+* **UNI:** [MahmoodLab/UNI](https://huggingface.co/MahmoodLab/UNI)
+* **UNI2-h:** [MahmoodLab/UNI2-h](https://huggingface.co/MahmoodLab/UNI2-h)
+#### Linear Probe Comparison
+| Model | Accuracy | F1 | Precision | Recall |
+|---|---|---|---|---|
+| NP-TEST-0 | *0.802* | *0.779* | *0.792* | *0.796* |
+| dinov2-giant | 0.667 | 0.648 | 0.669 | 0.667 |
+| dinov2-giant_distilled_prov | 0.769 | 0.756 | 0.755 | 0.769 |
+| dinov2-large_distilled_prov | 0.772 | 0.758 | 0.758 | 0.772 |
+| distilled_prov_finetuned | 0.779 | 0.762 | 0.770 | 0.779 |
+| prov-gigapath | 0.776 | 0.762 | 0.764 | 0.776 |
+| UNI | 0.741 | 0.731 | 0.734 | 0.741 |
+| UNI2-h | 0.768 | 0.750 | 0.753 | 0.768 |
+<img src="model_compare_radar.png" alt="chart" width="800"/>
+### Model Evaluation Details
+The radar chart provides a visual comparison of multiple models across several performance metrics. Each axis extending from the center represents a different metric. The farther a model's line is from the center along a particular axis, the better its score for that specific metric (assuming higher is better for the metric).
+**How to Interpret:**
+* **Axes:** Each spoke of the radar represents a distinct evaluation metric.
+* **Lines/Polygons:** Each colored line (forming a polygon) represents a different model.
+* **Performance:** A point on an axis closer to the outer edge indicates a higher score for that metric.
+* **Overall Comparison:** By comparing the shapes and sizes of the polygons, you can get a quick visual understanding of the strengths and weaknesses of each model relative to others. A larger overall polygon generally suggests better all-around performance on the displayed metrics.
+---
+**Tests**
+#### 1. Linear Probe
+* **What it is:** This test evaluates the quality of the model's learned features (embeddings). A simple linear classifier is trained on top of these frozen features to perform a classification task.
+* **Purpose:** It assesses how well the learned representations can be used for downstream tasks with a minimal amount of additional training. Good performance indicates that the embeddings are linearly separable and capture meaningful information.
+* **Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the linear classifier).
+#### 2. K-Nearest Neighbors (KNN) Evaluation
+* **What it is:** This test also evaluates the quality of the model's embeddings. Instead of training a new classifier, it uses the K-Nearest Neighbors algorithm directly on the embeddings to make predictions. For a given data point, its class is determined by the majority class among its 'k' closest neighbors in the embedding space.
+* **Purpose:** It assesses the local structure and similarity relationships within the embedding space. Good KNN performance suggests that similar items are close to each other in the learned representation.
+* **Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the KNN classifier).
+#### 3. Clustering
+* **What it is:** This set of tests evaluates how well the model's embeddings can naturally group similar items together without predefined labels (unsupervised). Algorithms like K-Means are often used to partition the data points based on their embeddings.
+* **Purpose:** It assesses the intrinsic structure and separability of the learned representations into meaningful groups.
+* **Common Metrics:**
+    * **Silhouette Score:** Measures how similar an object is to its own cluster compared to other clusters. Ranges from -1 to 1 (higher is better).
+    * **Adjusted Mutual Information (AMI):** Measures the agreement between true labels (if available) and clustering assignments, adjusted for chance. Ranges from 0 to 1 (higher is better).
+#### 4. Robustness
+* **What it is:** This is a general category of tests designed to measure how well a model maintains its performance when faced with various challenges or changes in the input data.
+* **Purpose:** It assesses the model's stability and reliability under non-ideal conditions.
+* **Examples of Challenges:** This can include noisy data, adversarial attacks (inputs intentionally designed to fool the model), out-of-distribution samples (data different from what the model was trained on), or other perturbations.
+* **Common Metrics:** Often a "Robustness Score" is reported, which could be an accuracy, F1-score, or other relevant metric evaluated on the challenged dataset. The specific calculation depends on the nature of the robustness test. (Higher is generally better).
+---
 ## How to Get Started with the Model
 ```
 ---
 **Acknowledgements:**