codybum commited on
Commit
5cb9330
·
verified ·
1 Parent(s): 9158ca2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -1
README.md CHANGED
@@ -250,7 +250,57 @@ if __name__ == "__main__":
250
  **Robustness Score:** 0.574
251
 
252
  **Overall Performance Score:** 0.646
253
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
 
255
  ## Ethical Considerations
256
 
 
250
  **Robustness Score:** 0.574
251
 
252
  **Overall Performance Score:** 0.646
253
+
254
+ ### Model Evaluation Radar Chart
255
+
256
+ The radar chart provides a visual comparison of multiple models across several performance metrics. Each axis extending from the center represents a different metric. The farther a model's line is from the center along a particular axis, the better its score for that specific metric (assuming higher is better for the metric).
257
+
258
+ **How to Interpret:**
259
+
260
+ * **Axes:** Each spoke of the radar represents a distinct evaluation metric.
261
+ * **Lines/Polygons:** Each colored line (forming a polygon) represents a different model.
262
+ * **Performance:** A point on an axis closer to the outer edge indicates a higher score for that metric.
263
+ * **Overall Comparison:** By comparing the shapes and sizes of the polygons, you can get a quick visual understanding of the strengths and weaknesses of each model relative to others. A larger overall polygon generally suggests better all-around performance on the displayed metrics.
264
+
265
+ ---
266
+
267
+ ### Evaluation Tests Displayed:
268
+
269
+ The chart displays results from several standard evaluation tests if their metrics are present in the `evaluation_results.json` files. The script also automatically discovers and plots other custom numeric metrics found within the "components" section of your JSON files.
270
+
271
+ #### 1. Linear Probe
272
+
273
+ * **What it is:** This test evaluates the quality of the model's learned features (embeddings). A simple linear classifier is trained on top of these frozen features to perform a classification task.
274
+ * **Purpose:** It assesses how well the learned representations can be used for downstream tasks with a minimal amount of additional training. Good performance indicates that the embeddings are linearly separable and capture meaningful information.
275
+ * **Common Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the linear classifier).
276
+
277
+ #### 2. K-Nearest Neighbors (KNN) Evaluation
278
+
279
+ * **What it is:** This test also evaluates the quality of the model's embeddings. Instead of training a new classifier, it uses the K-Nearest Neighbors algorithm directly on the embeddings to make predictions. For a given data point, its class is determined by the majority class among its 'k' closest neighbors in the embedding space.
280
+ * **Purpose:** It assesses the local structure and similarity relationships within the embedding space. Good KNN performance suggests that similar items are close to each other in the learned representation.
281
+ * **Common Metrics:** Accuracy, Precision, Recall, F1-Score (calculated for the KNN classifier).
282
+
283
+ #### 3. Clustering
284
+
285
+ * **What it is:** This set of tests evaluates how well the model's embeddings can naturally group similar items together without predefined labels (unsupervised). Algorithms like K-Means are often used to partition the data points based on their embeddings.
286
+ * **Purpose:** It assesses the intrinsic structure and separability of the learned representations into meaningful groups.
287
+ * **Common Metrics:**
288
+ * **Silhouette Score:** Measures how similar an object is to its own cluster compared to other clusters. Ranges from -1 to 1 (higher is better).
289
+ * **Adjusted Mutual Information (AMI):** Measures the agreement between true labels (if available) and clustering assignments, adjusted for chance. Ranges from 0 to 1 (higher is better).
290
+
291
+ #### 4. Robustness
292
+
293
+ * **What it is:** This is a general category of tests designed to measure how well a model maintains its performance when faced with various challenges or changes in the input data.
294
+ * **Purpose:** It assesses the model's stability and reliability under non-ideal conditions.
295
+ * **Examples of Challenges:** This can include noisy data, adversarial attacks (inputs intentionally designed to fool the model), out-of-distribution samples (data different from what the model was trained on), or other perturbations.
296
+ * **Common Metrics:** Often a "Robustness Score" is reported, which could be an accuracy, F1-score, or other relevant metric evaluated on the challenged dataset. The specific calculation depends on the nature of the robustness test. (Higher is generally better).
297
+
298
+ ---
299
+
300
+ **Custom Metrics:**
301
+ The radar chart will also display any other top-level numeric metrics or metrics nested one level deep within dictionaries found under the `"components"` key in your `evaluation_results.json` files. The names for these custom metrics on the chart will be derived from their keys in the JSON file (e.g., a key `inference_time_ms` would appear as `Inference time ms`).
302
+
303
+
304
 
305
  ## Ethical Considerations
306