MimiTechAI commited on
Commit
207ef19
·
verified ·
1 Parent(s): 29c5291

Fix: Clarify sample sizes in BFCL table, consistent irrelevance score notation

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -35,7 +35,7 @@ model-index:
35
  name: Multiple Sequential Calls
36
  verified: false
37
  - type: accuracy
38
- value: 90.0
39
  name: Irrelevance Detection
40
  verified: false
41
  pipeline_tag: text-generation
@@ -55,12 +55,12 @@ Part of the MIMI Model Family by [Mimi Tech AI](https://mimitechai.com).
55
 
56
  | Category | MIMI Pro V1 | Base Qwen3-4B | Notes |
57
  |---|---|---|---|
58
- | Simple Python | 60.8% (full 400) | **80.0%** | Base outperforms |
59
- | Simple Java | 21.0% (full 100) | **60.0%** | Base outperforms |
60
- | Multiple (Sequential) | **57.5%** (full 200) | 75.0% | Base outperforms |
61
- | Parallel | 2.0% (full 200) | **75.0%** | Fine-tune degraded |
62
- | Irrelevance | ~90% | **100%** | Both strong |
63
- | Live Simple | — | **90.0%** | Base only |
64
 
65
  > ⚠️ **Important Context:** The previously reported "97.7% accuracy" was a **training validation metric** (token-level accuracy on the training/eval split), not a standardized benchmark score. The table above shows actual BFCL V4 results. We are working on a full official evaluation.
66
 
 
35
  name: Multiple Sequential Calls
36
  verified: false
37
  - type: accuracy
38
+ value: 90
39
  name: Irrelevance Detection
40
  verified: false
41
  pipeline_tag: text-generation
 
55
 
56
  | Category | MIMI Pro V1 | Base Qwen3-4B | Notes |
57
  |---|---|---|---|
58
+ | Simple Python | 60.8% (400 tests) | **80.0%** (20 tests) | Base outperforms |
59
+ | Simple Java | 21.0% (100 tests) | **60.0%** (20 tests) | Base outperforms |
60
+ | Multiple (Sequential) | 57.5% (200 tests) | **75.0%** (20 tests) | Base outperforms |
61
+ | Parallel | 2.0% (200 tests) | **75.0%** (20 tests) | Fine-tune degraded |
62
+ | Irrelevance | 90% (20 tests) | **100%** (20 tests) | Both strong |
63
+ | Live Simple | — | **90.0%** (20 tests) | Base only |
64
 
65
  > ⚠️ **Important Context:** The previously reported "97.7% accuracy" was a **training validation metric** (token-level accuracy on the training/eval split), not a standardized benchmark score. The table above shows actual BFCL V4 results. We are working on a full official evaluation.
66