Fix: Clarify sample sizes in BFCL table, consistent irrelevance score notation

Files changed (1) hide show

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ model-index:
             name: Multiple Sequential Calls
             verified: false
           - type: accuracy
-            value: 90.0
             name: Irrelevance Detection
             verified: false
 pipeline_tag: text-generation
@@ -55,12 +55,12 @@ Part of the MIMI Model Family by [Mimi Tech AI](https://mimitechai.com).
 | Category | MIMI Pro V1 | Base Qwen3-4B | Notes |
 |---|---|---|---|
-| Simple Python | 60.8% (full 400) | **80.0%** | Base outperforms |
-| Simple Java | 21.0% (full 100) | **60.0%** | Base outperforms |
-| Multiple (Sequential) | **57.5%** (full 200) | 75.0% | Base outperforms |
-| Parallel | 2.0% (full 200) | **75.0%** | Fine-tune degraded |
-| Irrelevance | ~90% | **100%** | Both strong |
-| Live Simple | — | **90.0%** | Base only |
 > ⚠️ **Important Context:** The previously reported "97.7% accuracy" was a **training validation metric** (token-level accuracy on the training/eval split), not a standardized benchmark score. The table above shows actual BFCL V4 results. We are working on a full official evaluation.

             name: Multiple Sequential Calls
             verified: false
           - type: accuracy
+            value: 90
             name: Irrelevance Detection
             verified: false
 pipeline_tag: text-generation
 | Category | MIMI Pro V1 | Base Qwen3-4B | Notes |
 |---|---|---|---|
+| Simple Python | 60.8% (400 tests) | **80.0%** (20 tests) | Base outperforms |
+| Simple Java | 21.0% (100 tests) | **60.0%** (20 tests) | Base outperforms |
+| Multiple (Sequential) | 57.5% (200 tests) | **75.0%** (20 tests) | Base outperforms |
+| Parallel | 2.0% (200 tests) | **75.0%** (20 tests) | Fine-tune degraded |
+| Irrelevance | 90% (20 tests) | **100%** (20 tests) | Both strong |
+| Live Simple | — | **90.0%** (20 tests) | Base only |
 > ⚠️ **Important Context:** The previously reported "97.7% accuracy" was a **training validation metric** (token-level accuracy on the training/eval split), not a standardized benchmark score. The table above shows actual BFCL V4 results. We are working on a full official evaluation.