Add HumanEval benchmark results (pass@1: 3.0%)
Browse files
README.md
CHANGED
|
@@ -27,7 +27,7 @@ model-index:
|
|
| 27 |
type: openai_humaneval
|
| 28 |
metrics:
|
| 29 |
- type: pass@1
|
| 30 |
-
value:
|
| 31 |
name: pass@1
|
| 32 |
---
|
| 33 |
|
|
@@ -121,7 +121,19 @@ python -m mlx_lm.generate --model devstral-mlx --prompt "Write a function that..
|
|
| 121 |
|
| 122 |
## Evaluation
|
| 123 |
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
## Limitations
|
| 127 |
|
|
|
|
| 27 |
type: openai_humaneval
|
| 28 |
metrics:
|
| 29 |
- type: pass@1
|
| 30 |
+
value: 3.0
|
| 31 |
name: pass@1
|
| 32 |
---
|
| 33 |
|
|
|
|
| 121 |
|
| 122 |
## Evaluation
|
| 123 |
|
| 124 |
+
| Benchmark | Metric | Score | Notes |
|
| 125 |
+
|-----------|--------|-------|-------|
|
| 126 |
+
| HumanEval | pass@1 | 3.0% (5/164) | Low score expected — model fine-tuned on conversational coding (multi-turn dialogs with tool use), not bare function completion |
|
| 127 |
+
|
| 128 |
+
**Why the low HumanEval score?**
|
| 129 |
+
|
| 130 |
+
This model was trained on real AI coding conversations with:
|
| 131 |
+
- Multi-turn dialog context
|
| 132 |
+
- Tool calls and results
|
| 133 |
+
- Natural language explanations
|
| 134 |
+
- User-assistant interaction patterns
|
| 135 |
+
|
| 136 |
+
HumanEval tests **bare function completion** without dialog context, which is a different task. The model is optimized for conversational coding assistance, not standalone code generation.
|
| 137 |
|
| 138 |
## Limitations
|
| 139 |
|