Eclipse-Senpai commited on
Commit
7f614be
·
verified ·
1 Parent(s): 6ec88ae

Replace variants table with SmolLM2-style base-vs-instruct benchmark table

Browse files
Files changed (1) hide show
  1. README.md +12 -15
README.md CHANGED
@@ -58,13 +58,6 @@ KeyLM is a compact decoder-only transformer built on the standard small-model re
58
  | Precision | bfloat16 |
59
  | Training tokens | ~18B |
60
 
61
- ### Model variants
62
-
63
- | Variant | Type | Chat template | IFEval (4-metric avg) | Use for |
64
- |---|---|---|---|---|
65
- | [KeyLM-75M](https://huggingface.co/Eclipse-Senpai/KeyLM-75M) | Base (pretrained) | No | — | Fine-tuning, text completion |
66
- | [KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct) | Instruction-tuned | Yes | 17.85 | Chat, instruction following |
67
-
68
  GGUF builds for `llama.cpp`, LM Studio, and Ollama are available at [KeyLM-75M-Instruct-GGUF](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct-GGUF).
69
 
70
  ## How to Use
@@ -108,17 +101,21 @@ This is where KeyLM is competitive. All rows are evaluated with `lm_eval` (`ifev
108
 
109
  KeyLM beats the original SmolLM-135M-Instruct at roughly half the size and a fraction of the training data. SmolLM2-135M-Instruct, a far more heavily trained model, remains ahead.
110
 
111
- ### Knowledge and reasoning
112
 
113
- On zero-shot multiple-choice benchmarks (`lm_eval`; accuracy, with length-normalized accuracy for ARC and HellaSwag) KeyLM is modest but above random on basic commonsense, and at chance on knowledge-heavy tasks. This is expected at 75M parameters and 18B tokens.
114
 
115
- | Model | MMLU | ARC (avg) | HellaSwag | PIQA | WinoGrande | OpenBookQA |
116
- |---|---|---|---|---|---|---|
117
- | KeyLM-75M (base) | 23.0 | 29.9 | 29.7 | 60.0 | 48.4 | 25.0 |
118
- | **KeyLM-75M-Instruct** | **24.0** | **30.8** | **31.0** | **61.3** | **48.3** | **25.0** |
119
- | Random baseline | 25.0 | 25.0 | 25.0 | 50.0 | 50.0 | 25.0 |
 
 
 
 
120
 
121
- Base and instruct track each other closely, so instruction tuning leaves knowledge and reasoning roughly unchanged. PIQA and ARC-easy land clearly above chance, while MMLU sits at the random baseline.
122
 
123
  ## Training
124
 
 
58
  | Precision | bfloat16 |
59
  | Training tokens | ~18B |
60
 
 
 
 
 
 
 
 
61
  GGUF builds for `llama.cpp`, LM Studio, and Ollama are available at [KeyLM-75M-Instruct-GGUF](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct-GGUF).
62
 
63
  ## How to Use
 
101
 
102
  KeyLM beats the original SmolLM-135M-Instruct at roughly half the size and a fraction of the training data. SmolLM2-135M-Instruct, a far more heavily trained model, remains ahead.
103
 
104
+ ### Base vs Instruct
105
 
106
+ The base and instruction-tuned checkpoints across all benchmarks. Commonsense and knowledge tasks are zero-shot via `lm_eval` (accuracy; ARC and HellaSwag length-normalized); IFEval is the 4-metric average. Bold marks the stronger version per row.
107
 
108
+ | Benchmark | KeyLM-75M (base) | KeyLM-75M-Instruct | Random |
109
+ |---|---|---|---|
110
+ | IFEval (4-metric avg) | | **17.85** | |
111
+ | MMLU | 23.0 | **24.0** | 25.0 |
112
+ | ARC (avg) | 29.9 | **30.8** | 25.0 |
113
+ | HellaSwag | 29.7 | **31.0** | 25.0 |
114
+ | PIQA | 60.0 | **61.3** | 50.0 |
115
+ | WinoGrande | **48.4** | 48.3 | 50.0 |
116
+ | OpenBookQA | 25.0 | 25.0 | 25.0 |
117
 
118
+ Instruction tuning leaves knowledge and reasoning roughly unchanged; its real effect is the instruction-following ability IFEval captures. Both versions sit modestly above random on basic commonsense and at chance on MMLU.
119
 
120
  ## Training
121