wtfsayo commited on
Commit
96dc054
·
verified ·
1 Parent(s): fe1c616

Add HumanEval benchmark results (pass@1: 3.0%)

Browse files
Files changed (1) hide show
  1. README.md +14 -2
README.md CHANGED
@@ -27,7 +27,7 @@ model-index:
27
  type: openai_humaneval
28
  metrics:
29
  - type: pass@1
30
- value: TBD
31
  name: pass@1
32
  ---
33
 
@@ -121,7 +121,19 @@ python -m mlx_lm.generate --model devstral-mlx --prompt "Write a function that..
121
 
122
  ## Evaluation
123
 
124
- HumanEval pass@1 results will be added after benchmarking completes.
 
 
 
 
 
 
 
 
 
 
 
 
125
 
126
  ## Limitations
127
 
 
27
  type: openai_humaneval
28
  metrics:
29
  - type: pass@1
30
+ value: 3.0
31
  name: pass@1
32
  ---
33
 
 
121
 
122
  ## Evaluation
123
 
124
+ | Benchmark | Metric | Score | Notes |
125
+ |-----------|--------|-------|-------|
126
+ | HumanEval | pass@1 | 3.0% (5/164) | Low score expected — model fine-tuned on conversational coding (multi-turn dialogs with tool use), not bare function completion |
127
+
128
+ **Why the low HumanEval score?**
129
+
130
+ This model was trained on real AI coding conversations with:
131
+ - Multi-turn dialog context
132
+ - Tool calls and results
133
+ - Natural language explanations
134
+ - User-assistant interaction patterns
135
+
136
+ HumanEval tests **bare function completion** without dialog context, which is a different task. The model is optimized for conversational coding assistance, not standalone code generation.
137
 
138
  ## Limitations
139