Update Model Card: Add BFCL V4 scores, transparent benchmarking, V2 roadmap

Browse files

Files changed (1) hide show

README.md +77 -69

README.md CHANGED Viewed

@@ -20,75 +20,89 @@ model-index:
   - name: MIMI Pro
     results:
       - task:
-          type: text-generation
-          name: Tool/Function Calling
         metrics:
           - type: accuracy
-            value: 97.66
-            name: Token Accuracy
           - type: accuracy
-            value: 97.29
-            name: Eval Accuracy
-          - type: loss
-            value: 0.084
-            name: Training Loss
-library_name: transformers
 pipeline_tag: text-generation
 ---
 # MIMI Pro
-<p align="center">
-  <img src="https://img.shields.io/badge/MIMI-Pro-black?style=for-the-badge&labelColor=000000" alt="MIMI Pro"/>
-  <img src="https://img.shields.io/badge/Accuracy-97.7%25-brightgreen?style=for-the-badge" alt="Accuracy"/>
-  <img src="https://img.shields.io/badge/Size-2.3GB-orange?style=for-the-badge" alt="Size"/>
-  <img src="https://img.shields.io/badge/Runs_In-Browser-purple?style=for-the-badge" alt="Browser"/>
-  <img src="https://img.shields.io/badge/Cloud-Zero-red?style=for-the-badge" alt="Zero Cloud"/>
-</p>
-**MIMI Pro** is a 4-billion parameter AI agent model optimized for **structured tool calling and autonomous task execution** — designed to run entirely on-device, in the browser, with zero cloud dependencies.
-Part of the **MIMI Model Family** by [Mimi Tech AI](https://mimitechai.com).
-> 💡 MIMI Pro achieves **97.7% tool-calling accuracy** while running completely locally. Your data never leaves your device.
 ## Performance
 | Metric | Value |
-|--------|-------|
-| **Token Accuracy** | 97.66% |
-| **Eval Accuracy** | 97.29% |
-| **Training Loss** | 0.084 |
-| **Parameters** | 4.02 Billion |
-| **Quantized Size** | 2.3 GB (Q4_K_M) |
-| **Training Time** | 46 minutes |
-| **Training Hardware** | NVIDIA DGX Spark (Grace Blackwell) |
 ## Architecture
-MIMI Pro is built on the [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) architecture, fine-tuned with LoRA (rank=64, alpha=128) on 1,610 curated tool-calling examples using [Unsloth](https://github.com/unslothai/unsloth) on NVIDIA DGX Spark.
 **Key Design Decisions:**
-- **ChatML format** with `<think>` reasoning blocks for chain-of-thought
-- **19 tool types** covering web search, code execution, file operations, browser automation, and deep research
-- **Multi-step chains** — the model plans and executes sequences of tools autonomously
-- **Error recovery** — trained on failure cases to self-correct
 ## Supported Tools
 | Category | Tools |
-|----------|-------|
-| 🌐 **Web** | `web_search`, `browse_url`, `browser_action` |
-| 💻 **Code** | `execute_python`, `create_file`, `edit_file` |
-| 🔬 **Research** | `deep_research`, `generate_document` |
-| 📁 **System** | `read_file`, `list_directory`, `run_terminal` |
-| 🧠 **Reasoning** | Multi-step orchestration, error recovery |
 ## Quick Start
 ### Browser (wllama/WebAssembly)
-```typescript
 import { Wllama } from '@anthropic-ai/wllama';
 const wllama = new Wllama(wasmPaths);
@@ -124,34 +138,20 @@ output = llm.create_chat_completion(messages=[
 ## Output Format
-MIMI Pro generates structured tool calls:
-```xml
-<tool_call>
-{"name": "web_search", "arguments": {"query": "latest AI news March 2026", "num_results": 5}}
-</tool_call>
-```
-Multi-tool chains for complex tasks:
-```xml
-<tool_call>
-{"name": "web_search", "arguments": {"query": "NVIDIA DGX Spark specifications"}}
-</tool_call>
-<tool_call>
-{"name": "browse_url", "arguments": {"url": "https://nvidia.com/dgx-spark"}}
-</tool_call>
 ```
 ## The MIMI Model Family
 | Model | Parameters | Size | Target Device | Status |
-|-------|-----------|------|---------------|--------|
-| **MIMI Nano** | 0.6B | ~400 MB | Any device, IoT | 🔜 Coming |
-| **MIMI Small** | 1.7B | ~1.0 GB | Mobile & tablets | 🔜 Coming |
-| **MIMI Pro** | 4.02B | 2.3 GB | Desktop & laptop | ✅ **Available** |
-| **MIMI Max** | 8B | ~4.5 GB | Workstations | 🔜 Coming |
 All models share the same tool-calling format, are quantized to GGUF Q4_K_M, and run in the browser via WebAssembly.
@@ -178,19 +178,27 @@ hardware: NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified memory)
 ## Why MIMI?
-- **🔒 Privacy First** — Your data never leaves your device. Period.
-- **💰 Zero Cost** — No API keys, no subscriptions, no per-token billing.
-- **⚡ Fast** — Runs at native speed via WebAssembly, no server round-trips.
-- **🌍 Works Offline** — Once downloaded, no internet required.
-- **🔧 Tool Native** — Purpose-built for autonomous tool calling, not retrofitted.
 ## Limitations
-- Optimized for tool calling — for general chat, use the base model directly.
 - Context window: 4,096 tokens (training config). Base architecture supports 32K.
 - Requires ~3 GB RAM for inference in browser.
 - Q4_K_M quantization trades minimal quality for 3.5x size reduction.
 ## About Mimi Tech AI
 [Mimi Tech AI](https://mimitechai.com) builds on-device AI — no cloud, no data leaks, full user control.

   - name: MIMI Pro
     results:
       - task:
+          type: function-calling
+          name: Tool Calling
+        dataset:
+          type: gorilla-llm/Berkeley-Function-Calling-Leaderboard
+          name: BFCL V4
         metrics:
           - type: accuracy
+            value: 60.8
+            name: Simple Function Calling (Python)
+            verified: false
           - type: accuracy
+            value: 57.5
+            name: Multiple Sequential Calls
+            verified: false
+          - type: accuracy
+            value: 90.0
+            name: Irrelevance Detection
+            verified: false
 pipeline_tag: text-generation
 ---
 # MIMI Pro
+MIMI Pro is a 4-billion parameter AI agent model optimized for structured tool calling and autonomous task execution — designed to run entirely on-device, in the browser, with zero cloud dependencies.
+Part of the MIMI Model Family by [Mimi Tech AI](https://mimitechai.com).
+> **🔬 V1 — Experimental Release.** This model is fine-tuned for the MIMI Agent's custom tool-calling format. For standard tool calling, the base [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) may perform equally well or better with native `<tool_call>` prompting. V2 with official BFCL scores and Qwen3-native format support is in development.
 ## Performance
+### BFCL V4 Benchmark (Partial — Single-Turn, 20 samples/category)
+| Category | MIMI Pro V1 | Base Qwen3-4B | Notes |
+|---|---|---|---|
+| Simple Python | 60.8% (full 400) | **80.0%** | Base outperforms |
+| Simple Java | 21.0% (full 100) | **60.0%** | Base outperforms |
+| Multiple (Sequential) | **57.5%** (full 200) | 75.0% | Base outperforms |
+| Parallel | 2.0% (full 200) | **75.0%** | Fine-tune degraded |
+| Irrelevance | ~90% | **100%** | Both strong |
+| Live Simple | — | **90.0%** | Base only |
+> ⚠️ **Important Context:** The previously reported "97.7% accuracy" was a **training validation metric** (token-level accuracy on the training/eval split), not a standardized benchmark score. The table above shows actual BFCL V4 results. We are working on a full official evaluation.
+### Training Metrics (Internal)
 | Metric | Value |
+|---|---|
+| Training Token Accuracy | 97.66% |
+| Eval Token Accuracy | 97.29% |
+| Training Loss | 0.084 |
+| Parameters | 4.02 Billion |
+| Quantized Size | 2.3 GB (Q4_K_M) |
 ## Architecture
+MIMI Pro is built on [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), fine-tuned with LoRA (rank=64, alpha=128) on 1,610 curated tool-calling examples using [Unsloth](https://github.com/unslothai/unsloth) on NVIDIA DGX Spark.
 **Key Design Decisions:**
+- Custom tool-calling format optimized for the MIMI Agent browser environment
+- 19 tool types covering web search, code execution, file operations, browser automation
+- Trained on NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified memory)
+**Known Limitations of V1:**
+- Fine-tuning with aggressive hyperparameters (LoRA r=64, 3 epochs, LR 2e-4) caused some capability degradation vs. the base model, particularly for parallel tool calling
+- The custom `{"tool": ..., "parameters": ...}` format diverges from Qwen3's native `<tool_call>` format
+- V2 will address these issues with conservative fine-tuning and Qwen3-native format support
 ## Supported Tools
 | Category | Tools |
+|---|---|
+| 🌐 Web | web_search, browse_url, browser_action |
+| 💻 Code | execute_python, create_file, edit_file |
+| 🔬 Research | deep_research, generate_document |
+| 📁 System | read_file, list_directory, run_terminal |
+| 🧠 Reasoning | Multi-step orchestration |
 ## Quick Start
 ### Browser (wllama/WebAssembly)
+```javascript
 import { Wllama } from '@anthropic-ai/wllama';
 const wllama = new Wllama(wasmPaths);
 ## Output Format
+MIMI Pro V1 uses a custom format (V2 will support Qwen3-native `<tool_call>` format):
+```json
+{"tool": "web_search", "parameters": {"query": "latest AI news March 2026", "limit": 5}}
 ```
 ## The MIMI Model Family
 | Model | Parameters | Size | Target Device | Status |
+|---|---|---|---|---|
+| MIMI Nano | 0.6B | ~400 MB | Any device, IoT | 🔜 Coming |
+| MIMI Small | 1.7B | ~1.0 GB | Mobile & tablets | 🔜 Coming |
+| **MIMI Pro** | **4.02B** | **2.3 GB** | **Desktop & laptop** | **✅ Available** |
+| MIMI Max | 8B | ~4.5 GB | Workstations | 🔜 Coming |
 All models share the same tool-calling format, are quantized to GGUF Q4_K_M, and run in the browser via WebAssembly.
 ## Why MIMI?
+- 🔒 **Privacy First** — Your data never leaves your device. Period.
+- 💰 **Zero Cost** — No API keys, no subscriptions, no per-token billing.
+- ⚡ **Fast** — Runs at native speed via WebAssembly, no server round-trips.
+- 🌍 **Works Offline** — Once downloaded, no internet required.
+- 🔧 **Tool Native** — Purpose-built for autonomous tool calling.
 ## Limitations
+- V1 uses a custom tool-calling format (not Qwen3-native `<tool_call>`)
+- Parallel tool calling (multiple simultaneous calls) is degraded vs. base model
 - Context window: 4,096 tokens (training config). Base architecture supports 32K.
 - Requires ~3 GB RAM for inference in browser.
 - Q4_K_M quantization trades minimal quality for 3.5x size reduction.
+## Roadmap
+- [x] **V1** — Custom format, 19 tools, browser-optimized (current release)
+- [ ] **V2** — Qwen3-native `<tool_call>` format, official BFCL V4 scores, conservative fine-tuning
+- [ ] **Model Family** — Nano (0.6B), Small (1.7B), Max (8B) releases
+- [ ] **Multi-Turn** — Agentic conversation chains with tool result feedback
 ## About Mimi Tech AI
 [Mimi Tech AI](https://mimitechai.com) builds on-device AI — no cloud, no data leaks, full user control.