Update README.md
Browse files
README.md
CHANGED
|
@@ -250,7 +250,57 @@ if __name__ == "__main__":
|
|
| 250 |
**Robustness Score:** 0.574
|
| 251 |
|
| 252 |
**Overall Performance Score:** 0.646
|
| 253 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 254 |
|
| 255 |
## Ethical Considerations
|
| 256 |
|
|
|
|
| 250 |
**Robustness Score:** 0.574
|
| 251 |
|
| 252 |
**Overall Performance Score:** 0.646
|
| 253 |
+
|
| 254 |
+
### Model Evaluation Radar Chart
|
| 255 |
+
|
| 256 |
+
The radar chart provides a visual comparison of multiple models across several performance metrics. Each axis extending from the center represents a different metric. The farther a model's line is from the center along a particular axis, the better its score for that specific metric (assuming higher is better for the metric).
|
| 257 |
+
|
| 258 |
+
**How to Interpret:**
|
| 259 |
+
|
| 260 |
+
* **Axes:** Each spoke of the radar represents a distinct evaluation metric.
|
| 261 |
+
* **Lines/Polygons:** Each colored line (forming a polygon) represents a different model.
|
| 262 |
+
* **Performance:** A point on an axis closer to the outer edge indicates a higher score for that metric.
|
| 263 |
+
* **Overall Comparison:** By comparing the shapes and sizes of the polygons, you can get a quick visual understanding of the strengths and weaknesses of each model relative to others. A larger overall polygon generally suggests better all-around performance on the displayed metrics.
|
| 264 |
+
|
| 265 |
+
---
|
| 266 |
+
|
| 267 |
+
### Evaluation Tests Displayed:
|
| 268 |
+
|
| 269 |
+
The chart displays results from several standard evaluation tests if their metrics are present in the `evaluation_results.json` files. The script also automatically discovers and plots other custom numeric metrics found within the "components" section of your JSON files.
|
| 270 |
+
|
| 271 |
+
#### 1. Linear Probe
|
| 272 |
+
|
| 273 |
+
* **What it is:** This test evaluates the quality of the model's learned features (embeddings). A simple linear classifier is trained on top of these frozen features to perform a classification task.
|
| 274 |
+
* **Purpose:** It assesses how well the learned representations can be used for downstream tasks with a minimal amount of additional training. Good performance indicates that the embeddings are linearly separable and capture meaningful information.
|
| 275 |
+
* **Common Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the linear classifier).
|
| 276 |
+
|
| 277 |
+
#### 2. K-Nearest Neighbors (KNN) Evaluation
|
| 278 |
+
|
| 279 |
+
* **What it is:** This test also evaluates the quality of the model's embeddings. Instead of training a new classifier, it uses the K-Nearest Neighbors algorithm directly on the embeddings to make predictions. For a given data point, its class is determined by the majority class among its 'k' closest neighbors in the embedding space.
|
| 280 |
+
* **Purpose:** It assesses the local structure and similarity relationships within the embedding space. Good KNN performance suggests that similar items are close to each other in the learned representation.
|
| 281 |
+
* **Common Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the KNN classifier).
|
| 282 |
+
|
| 283 |
+
#### 3. Clustering
|
| 284 |
+
|
| 285 |
+
* **What it is:** This set of tests evaluates how well the model's embeddings can naturally group similar items together without predefined labels (unsupervised). Algorithms like K-Means are often used to partition the data points based on their embeddings.
|
| 286 |
+
* **Purpose:** It assesses the intrinsic structure and separability of the learned representations into meaningful groups.
|
| 287 |
+
* **Common Metrics:**
|
| 288 |
+
* **Silhouette Score:** Measures how similar an object is to its own cluster compared to other clusters. Ranges from -1 to 1 (higher is better).
|
| 289 |
+
* **Adjusted Mutual Information (AMI):** Measures the agreement between true labels (if available) and clustering assignments, adjusted for chance. Ranges from 0 to 1 (higher is better).
|
| 290 |
+
|
| 291 |
+
#### 4. Robustness
|
| 292 |
+
|
| 293 |
+
* **What it is:** This is a general category of tests designed to measure how well a model maintains its performance when faced with various challenges or changes in the input data.
|
| 294 |
+
* **Purpose:** It assesses the model's stability and reliability under non-ideal conditions.
|
| 295 |
+
* **Examples of Challenges:** This can include noisy data, adversarial attacks (inputs intentionally designed to fool the model), out-of-distribution samples (data different from what the model was trained on), or other perturbations.
|
| 296 |
+
* **Common Metrics:** Often a "Robustness Score" is reported, which could be an accuracy, F1-score, or other relevant metric evaluated on the challenged dataset. The specific calculation depends on the nature of the robustness test. (Higher is generally better).
|
| 297 |
+
|
| 298 |
+
---
|
| 299 |
+
|
| 300 |
+
**Custom Metrics:**
|
| 301 |
+
The radar chart will also display any other top-level numeric metrics or metrics nested one level deep within dictionaries found under the `"components"` key in your `evaluation_results.json` files. The names for these custom metrics on the chart will be derived from their keys in the JSON file (e.g., a key `inference_time_ms` would appear as `Inference time ms`).
|
| 302 |
+
|
| 303 |
+
|
| 304 |
|
| 305 |
## Ethical Considerations
|
| 306 |
|