File size: 6,334 Bytes
b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 ca5e3d7 b684ab3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | ---
license: mit
task_categories:
- question-answering
- text-generation
language:
- en
tags:
- benchmark
- reasoning
- multi-step
- evaluation
- llm-evaluation
- goodhart
- execution-vs-understanding
- consensus
- multi-model
size_categories:
- 1K<n<10K
---
# Goodhart Gap Benchmark
**Detecting the gap between understanding and execution in language models**
## Overview
The Goodhart Gap Benchmark tests whether language models can correctly *execute* multi-step reasoning tasks that they can correctly *explain*. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.
## Data Sources
This benchmark combines two data sources:
### 1. CGRT Consensus Dataset (Primary)
**Source**: [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model)
| Metric | Value |
|--------|-------|
| Total problems | 61,678 |
| Models queried | 5 (Claude, GPT-4, Gemini, DeepSeek, Qwen) |
| API cost | ~$1,000 |
| Disagreement cases | 8,050 |
| Contested (strongest) | 1,556 |
Each problem includes full reasoning traces from 5 frontier models, enabling analysis of where execution diverges despite similar understanding.
### 2. Programmatic Multi-Domain Problems
| Metric | Value |
|--------|-------|
| Total problems | 101 |
| Domains | 12 |
| Cost | $0 (generated) |
## Dataset Files
| File | Description | Count |
|------|-------------|-------|
| `combined_test.jsonl` | Main evaluation set (contested + programmatic) | 1,657 |
| `goodhart_disagreements.jsonl` | All disagreement cases from consensus | 8,050 |
| `goodhart_contested.jsonl` | Strongest Goodhart Gap cases | 1,556 |
| `test.jsonl` | Programmatic problems only | 101 |
## Key Finding
**Models that consistently show chain-of-thought execute correctly; models that give quick answers fail.**
| Model | Financial Domain | Behavior |
|-------|------------------|----------|
| Claude 3.5 Haiku | 100% | Always shows work |
| Claude Sonnet 4 | 30% | Sometimes skips work |
| gpt-4o | 30% | Sometimes skips work |
| gpt-4o-mini | 0% | Usually skips work |
The financial domain (compound interest + tax) is the strongest Goodhart Gap detector.
## Data Format
### Consensus-derived examples
```json
{
"id": "consensus_12345",
"domain": "math_consensus",
"problem": "A store sells apples for $2 each...",
"correct_answer": "15",
"source": "cgrt-consensus-5model",
"consensus_tier": "contested",
"model_responses": {
"claude": {"answer": "15", "response": "Step 1..."},
"codex": {"answer": "14", "response": "First..."},
"gemini": {"answer": "15", "response": "Let me..."},
"deepseek": {"answer": "16", "response": "..."},
"qwen": {"answer": "15", "response": "..."}
},
"difficulty": "hard"
}
```
### Programmatic examples
```json
{
"id": "math_discount_01",
"domain": "math_discount",
"problem": "A product costs $25 and is on 20% sale...",
"correct_answer": "15",
"explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
"source": "programmatic",
"difficulty": "easy",
"steps": 2
}
```
## Consensus Tiers
| Tier | Description | Count |
|------|-------------|-------|
| **Gold** | All 5 models agree | 51,174 |
| **Silver** | 4/5 models agree | 5,766 |
| **Bronze** | 3/5 models agree | 3,182 |
| **Contested** | No majority (strongest Goodhart Gap) | 1,556 |
## Domains
### From Consensus Data
- Math word problems (GSM8K-style)
- Multi-step arithmetic
- Rate/ratio problems
### Programmatic Domains
| Domain | Count | Type |
|--------|-------|------|
| math_discount | 15 | Numerical |
| time | 13 | Numerical |
| financial | 10 | Numerical |
| logic | 8 | Numerical |
| recipe | 7 | Numerical |
| scheduling | 7 | Numerical |
| units | 7 | Numerical |
| spatial | 7 | Non-numerical |
| procedural | 6 | Non-numerical |
| text | 7 | Non-numerical |
| sequence | 7 | Non-numerical |
| causal | 7 | Non-numerical |
## Usage
### Quick Evaluation
```bash
# Evaluate on combined test set
python evaluate.py --provider anthropic --model claude-3-5-haiku-latest --dataset combined_test.jsonl
# Evaluate on contested only (hardest)
python evaluate.py --provider openai --model gpt-4o --dataset goodhart_contested.jsonl
```
### Python API
```python
import json
# Load combined test set
with open('data/combined_test.jsonl') as f:
problems = [json.loads(line) for line in f]
# Analyze consensus examples with model responses
for p in problems:
if p.get('source') == 'cgrt-consensus-5model':
# Has full model reasoning traces
for model, data in p['model_responses'].items():
print(f"{model}: {data['answer']}")
```
### With HuggingFace Datasets
```python
from datasets import load_dataset
dataset = load_dataset("Adam1010/goodhart-gap-benchmark")
```
## Leaderboard
| Model | Provider | Pass Rate | Notes |
|-------|----------|-----------|-------|
| Claude 3.5 Haiku | Anthropic | 93% | Shows work consistently |
| Claude Sonnet 4 | Anthropic | 79% | |
| gpt-4o | OpenAI | 57% | |
| gpt-4o-mini | OpenAI | 36% | |
## Why This Matters
### For AI Safety
- Models explaining correctly but executing incorrectly are harder to detect
- Gap between capability benchmarks and deployment readiness
- Critical for agentic AI systems
### For Training
- Disagreement cases reveal where models need improvement
- Chain-of-thought consistency matters more than raw capability
- Smaller models (Haiku) can outperform larger ones through reliable execution
## Citation
```bibtex
@dataset{goodhart_gap_benchmark_2026,
title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
author={Adam Kruger},
year={2026},
url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark},
note={Built on cgrt-consensus-5model dataset}
}
```
## Related Datasets
- [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model) - Source consensus data
## License
MIT License - free for research and commercial use.
## Acknowledgments
- CGRT (Consensus-Guided Recursive Training) research
- 5-model consensus data collection (~$1000 in API calls)
- Goodhart's Law and its application to AI evaluation
|