Fix: Clarify sample sizes in BFCL table, consistent irrelevance score notation
Browse files
README.md
CHANGED
|
@@ -35,7 +35,7 @@ model-index:
|
|
| 35 |
name: Multiple Sequential Calls
|
| 36 |
verified: false
|
| 37 |
- type: accuracy
|
| 38 |
-
value: 90
|
| 39 |
name: Irrelevance Detection
|
| 40 |
verified: false
|
| 41 |
pipeline_tag: text-generation
|
|
@@ -55,12 +55,12 @@ Part of the MIMI Model Family by [Mimi Tech AI](https://mimitechai.com).
|
|
| 55 |
|
| 56 |
| Category | MIMI Pro V1 | Base Qwen3-4B | Notes |
|
| 57 |
|---|---|---|---|
|
| 58 |
-
| Simple Python | 60.8% (
|
| 59 |
-
| Simple Java | 21.0% (
|
| 60 |
-
| Multiple (Sequential) |
|
| 61 |
-
| Parallel | 2.0% (
|
| 62 |
-
| Irrelevance |
|
| 63 |
-
| Live Simple | — | **90.0%** | Base only |
|
| 64 |
|
| 65 |
> ⚠️ **Important Context:** The previously reported "97.7% accuracy" was a **training validation metric** (token-level accuracy on the training/eval split), not a standardized benchmark score. The table above shows actual BFCL V4 results. We are working on a full official evaluation.
|
| 66 |
|
|
|
|
| 35 |
name: Multiple Sequential Calls
|
| 36 |
verified: false
|
| 37 |
- type: accuracy
|
| 38 |
+
value: 90
|
| 39 |
name: Irrelevance Detection
|
| 40 |
verified: false
|
| 41 |
pipeline_tag: text-generation
|
|
|
|
| 55 |
|
| 56 |
| Category | MIMI Pro V1 | Base Qwen3-4B | Notes |
|
| 57 |
|---|---|---|---|
|
| 58 |
+
| Simple Python | 60.8% (400 tests) | **80.0%** (20 tests) | Base outperforms |
|
| 59 |
+
| Simple Java | 21.0% (100 tests) | **60.0%** (20 tests) | Base outperforms |
|
| 60 |
+
| Multiple (Sequential) | 57.5% (200 tests) | **75.0%** (20 tests) | Base outperforms |
|
| 61 |
+
| Parallel | 2.0% (200 tests) | **75.0%** (20 tests) | Fine-tune degraded |
|
| 62 |
+
| Irrelevance | 90% (20 tests) | **100%** (20 tests) | Both strong |
|
| 63 |
+
| Live Simple | — | **90.0%** (20 tests) | Base only |
|
| 64 |
|
| 65 |
> ⚠️ **Important Context:** The previously reported "97.7% accuracy" was a **training validation metric** (token-level accuracy on the training/eval split), not a standardized benchmark score. The table above shows actual BFCL V4 results. We are working on a full official evaluation.
|
| 66 |
|