Upload 2 files

Files changed (3) hide show

.gitattributes CHANGED Viewed

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+benchmark_comparison.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -72,6 +72,22 @@ Reasoning samples are wrapped with `<think>…</think>` tags and upsampled 10×
 Results from [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):
 | Benchmark | Few-shot | Metric | Score | ± |
 |-----------|----------|--------|-------|---|
 | GSM8K | 5 | flexible-extract / exact_match | **0.6293** | 0.0133 |

 Results from [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):
+### Comparison with Peer Models
+![Benchmark Comparison](benchmark_comparison.png)
+> `< 10%` entries are displayed as `<10%` in the chart.
+| Benchmark | Arcade-3B | Gemma-2-2B | Llama-2-7B | Qwen1.5-1.8B | OpenLLaMA-v2-3B |
+|-----------|-----------|------------|------------|--------------|-----------------|
+| MMLU | **52.9%** | 52.4% | 45.3% | 46.8% | 41.0% |
+| GSM8K | **62.9%** | 50.9% | 14.6% | 37.8% | < 10% |
+| HumanEval | **41.5%** | 32.3% | 12.8% | 27.4% | < 10% |
+| ARC-Challenge | 52.6% | **53.1%** | 46.2% | 41.2% | 34.2% |
+| ARC-Easy | 74.4% | **75.9%** | 75.3% | 66.8% | 68.1% |
+### Arcade-3B Detailed Scores
 | Benchmark | Few-shot | Metric | Score | ± |
 |-----------|----------|--------|-------|---|
 | GSM8K | 5 | flexible-extract / exact_match | **0.6293** | 0.0133 |

benchmark_comparison.png ADDED Viewed