Add HumanEval benchmark results (pass@1: 3.0%)

Files changed (1) hide show

README.md CHANGED Viewed

@@ -27,7 +27,7 @@ model-index:
       type: openai_humaneval
     metrics:
     - type: pass@1
-      value: TBD
       name: pass@1
 ---
@@ -121,7 +121,19 @@ python -m mlx_lm.generate --model devstral-mlx --prompt "Write a function that..
 ## Evaluation
-HumanEval pass@1 results will be added after benchmarking completes.
 ## Limitations

       type: openai_humaneval
     metrics:
     - type: pass@1
+      value: 3.0
       name: pass@1
 ---
 ## Evaluation
+| Benchmark | Metric | Score | Notes |
+|-----------|--------|-------|-------|
+| HumanEval | pass@1 | 3.0% (5/164) | Low score expected — model fine-tuned on conversational coding (multi-turn dialogs with tool use), not bare function completion |
+**Why the low HumanEval score?**
+This model was trained on real AI coding conversations with:
+- Multi-turn dialog context
+- Tool calls and results
+- Natural language explanations
+- User-assistant interaction patterns
+HumanEval tests **bare function completion** without dialog context, which is a different task. The model is optimized for conversational coding assistance, not standalone code generation.
 ## Limitations