legolasyiu commited on
Commit
d242dbb
·
verified ·
1 Parent(s): 15732dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -7
README.md CHANGED
@@ -60,13 +60,24 @@ Debugged vibecoder dataset
60
 
61
  ## Benchmark
62
 
63
- | Tasks | Version | Filter | n-shot | Metric | VibeCoder-20b-0.02-Debugger | gpt-oss-20 | Qwen 3 235B |
64
- |-----------------|---------|------------------|--------|--------------|-----------------------------|------------|-------------|
65
- | gsm8k_cot | 3 | flexible-extract | 3 | exact_match | 0.8452+(0.7667) | 0.78 | 0.82 |
66
- | humaneval | 1 | create_test | 0 | exact_match ↑ | 0.933+( 0.8) | 0.73 | 0.92 |
67
- | mmlu_college_biology| 1 | create_test | 0 | exact_match ↑ | 1.0 | | |
68
- | mmlu_high_school_computer_science| 1 | create_test | 0 | exact_match ↑ | 1.0+(0.9) | | |
69
- |computer_security| 1 | none | 2 | acc |0.8528+(0.700)| | |
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ## Example Usage
72
 
 
60
 
61
  ## Benchmark
62
 
63
+ ### 📊 Model Evaluation Results
64
+
65
+ | Tasks | Version | Filter | n-shot | Metric | VibeCoder-20b-0.02-Debugger | gpt-oss-20 | Qwen 3 235B |
66
+ |-----------------------------------|----------|------------------|--------|----------------|-----------------------------|-------------|--------------|
67
+ | gsm8k_cot | 3 | flexible-extract | 3 | exact_match ↑ | 0.8452 (+0.7667) | 0.78 | 0.82 |
68
+ | humaneval | 1 | create_test | 0 | exact_match ↑ | 0.933 (+0.8) | 0.73 | 0.92 |
69
+ | mmlu_college_biology | 1 | create_test | 0 | exact_match | 1.000 (+) | ||
70
+ | mmlu_high_school_computer_science | 1 | create_test | 0 | exact_match ↑ | 1.000 (+0.9) | — | — |
71
+ | computer_security | 1 | none | 2 | acc ↑ | 0.8528 (+0.700) | — | — |
72
+ | college_computer_science | 1 | none | 2 | acc ↑ | 0.8528 (+0.700) | — | — |
73
+
74
+ ---
75
+
76
+ **Notes:**
77
+ - The `(+value)` indicates delta over baseline evaluation.
78
+ - Metrics marked with `↑` denote that higher is better.
79
+ - Dashes (`—`) indicate results not yet reported or evaluated.
80
+
81
 
82
  ## Example Usage
83