Upload 2 files
Browse files- .gitattributes +1 -0
- README.md +16 -0
- benchmark_comparison.png +3 -0
.gitattributes
CHANGED
|
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
benchmark_comparison.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -72,6 +72,22 @@ Reasoning samples are wrapped with `<think>…</think>` tags and upsampled 10×
|
|
| 72 |
|
| 73 |
Results from [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
| Benchmark | Few-shot | Metric | Score | ± |
|
| 76 |
|-----------|----------|--------|-------|---|
|
| 77 |
| GSM8K | 5 | flexible-extract / exact_match | **0.6293** | 0.0133 |
|
|
|
|
| 72 |
|
| 73 |
Results from [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):
|
| 74 |
|
| 75 |
+
### Comparison with Peer Models
|
| 76 |
+
|
| 77 |
+

|
| 78 |
+
|
| 79 |
+
> `< 10%` entries are displayed as `<10%` in the chart.
|
| 80 |
+
|
| 81 |
+
| Benchmark | Arcade-3B | Gemma-2-2B | Llama-2-7B | Qwen1.5-1.8B | OpenLLaMA-v2-3B |
|
| 82 |
+
|-----------|-----------|------------|------------|--------------|-----------------|
|
| 83 |
+
| MMLU | **52.9%** | 52.4% | 45.3% | 46.8% | 41.0% |
|
| 84 |
+
| GSM8K | **62.9%** | 50.9% | 14.6% | 37.8% | < 10% |
|
| 85 |
+
| HumanEval | **41.5%** | 32.3% | 12.8% | 27.4% | < 10% |
|
| 86 |
+
| ARC-Challenge | 52.6% | **53.1%** | 46.2% | 41.2% | 34.2% |
|
| 87 |
+
| ARC-Easy | 74.4% | **75.9%** | 75.3% | 66.8% | 68.1% |
|
| 88 |
+
|
| 89 |
+
### Arcade-3B Detailed Scores
|
| 90 |
+
|
| 91 |
| Benchmark | Few-shot | Metric | Score | ± |
|
| 92 |
|-----------|----------|--------|-------|---|
|
| 93 |
| GSM8K | 5 | flexible-extract / exact_match | **0.6293** | 0.0133 |
|
benchmark_comparison.png
ADDED
|
Git LFS Details
|