File size: 6,334 Bytes
b684ab3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca5e3d7
 
b684ab3
ca5e3d7
b684ab3
 
 
 
 
 
 
 
 
 
ca5e3d7
b684ab3
ca5e3d7
b684ab3
ca5e3d7
 
b684ab3
ca5e3d7
 
 
 
 
 
 
b684ab3
ca5e3d7
b684ab3
ca5e3d7
b684ab3
 
 
 
ca5e3d7
b684ab3
ca5e3d7
b684ab3
ca5e3d7
 
 
 
 
 
b684ab3
ca5e3d7
b684ab3
ca5e3d7
b684ab3
ca5e3d7
 
 
 
 
 
b684ab3
ca5e3d7
b684ab3
 
 
ca5e3d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b684ab3
ca5e3d7
b684ab3
 
 
 
ca5e3d7
b684ab3
 
ca5e3d7
b684ab3
 
 
 
 
ca5e3d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b684ab3
 
 
 
 
ca5e3d7
 
b684ab3
ca5e3d7
 
b684ab3
 
 
 
 
 
ca5e3d7
 
 
 
 
 
 
 
 
 
b684ab3
 
 
 
 
 
ca5e3d7
b684ab3
 
 
 
ca5e3d7
 
 
 
 
 
b684ab3
 
 
 
ca5e3d7
 
 
b684ab3
 
ca5e3d7
 
 
b684ab3
 
 
 
 
 
 
 
ca5e3d7
 
b684ab3
 
 
ca5e3d7
b684ab3
ca5e3d7
b684ab3
ca5e3d7
b684ab3
ca5e3d7
b684ab3
 
 
ca5e3d7
 
b684ab3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
license: mit
task_categories:
  - question-answering
  - text-generation
language:
  - en
tags:
  - benchmark
  - reasoning
  - multi-step
  - evaluation
  - llm-evaluation
  - goodhart
  - execution-vs-understanding
  - consensus
  - multi-model
size_categories:
  - 1K<n<10K
---

# Goodhart Gap Benchmark

**Detecting the gap between understanding and execution in language models**

## Overview

The Goodhart Gap Benchmark tests whether language models can correctly *execute* multi-step reasoning tasks that they can correctly *explain*. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.

## Data Sources

This benchmark combines two data sources:

### 1. CGRT Consensus Dataset (Primary)
**Source**: [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model)

| Metric | Value |
|--------|-------|
| Total problems | 61,678 |
| Models queried | 5 (Claude, GPT-4, Gemini, DeepSeek, Qwen) |
| API cost | ~$1,000 |
| Disagreement cases | 8,050 |
| Contested (strongest) | 1,556 |

Each problem includes full reasoning traces from 5 frontier models, enabling analysis of where execution diverges despite similar understanding.

### 2. Programmatic Multi-Domain Problems
| Metric | Value |
|--------|-------|
| Total problems | 101 |
| Domains | 12 |
| Cost | $0 (generated) |

## Dataset Files

| File | Description | Count |
|------|-------------|-------|
| `combined_test.jsonl` | Main evaluation set (contested + programmatic) | 1,657 |
| `goodhart_disagreements.jsonl` | All disagreement cases from consensus | 8,050 |
| `goodhart_contested.jsonl` | Strongest Goodhart Gap cases | 1,556 |
| `test.jsonl` | Programmatic problems only | 101 |

## Key Finding

**Models that consistently show chain-of-thought execute correctly; models that give quick answers fail.**

| Model | Financial Domain | Behavior |
|-------|------------------|----------|
| Claude 3.5 Haiku | 100% | Always shows work |
| Claude Sonnet 4 | 30% | Sometimes skips work |
| gpt-4o | 30% | Sometimes skips work |
| gpt-4o-mini | 0% | Usually skips work |

The financial domain (compound interest + tax) is the strongest Goodhart Gap detector.

## Data Format

### Consensus-derived examples
```json
{
  "id": "consensus_12345",
  "domain": "math_consensus",
  "problem": "A store sells apples for $2 each...",
  "correct_answer": "15",
  "source": "cgrt-consensus-5model",
  "consensus_tier": "contested",
  "model_responses": {
    "claude": {"answer": "15", "response": "Step 1..."},
    "codex": {"answer": "14", "response": "First..."},
    "gemini": {"answer": "15", "response": "Let me..."},
    "deepseek": {"answer": "16", "response": "..."},
    "qwen": {"answer": "15", "response": "..."}
  },
  "difficulty": "hard"
}
```

### Programmatic examples
```json
{
  "id": "math_discount_01",
  "domain": "math_discount",
  "problem": "A product costs $25 and is on 20% sale...",
  "correct_answer": "15",
  "explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
  "source": "programmatic",
  "difficulty": "easy",
  "steps": 2
}
```

## Consensus Tiers

| Tier | Description | Count |
|------|-------------|-------|
| **Gold** | All 5 models agree | 51,174 |
| **Silver** | 4/5 models agree | 5,766 |
| **Bronze** | 3/5 models agree | 3,182 |
| **Contested** | No majority (strongest Goodhart Gap) | 1,556 |

## Domains

### From Consensus Data
- Math word problems (GSM8K-style)
- Multi-step arithmetic
- Rate/ratio problems

### Programmatic Domains
| Domain | Count | Type |
|--------|-------|------|
| math_discount | 15 | Numerical |
| time | 13 | Numerical |
| financial | 10 | Numerical |
| logic | 8 | Numerical |
| recipe | 7 | Numerical |
| scheduling | 7 | Numerical |
| units | 7 | Numerical |
| spatial | 7 | Non-numerical |
| procedural | 6 | Non-numerical |
| text | 7 | Non-numerical |
| sequence | 7 | Non-numerical |
| causal | 7 | Non-numerical |

## Usage

### Quick Evaluation
```bash
# Evaluate on combined test set
python evaluate.py --provider anthropic --model claude-3-5-haiku-latest --dataset combined_test.jsonl

# Evaluate on contested only (hardest)
python evaluate.py --provider openai --model gpt-4o --dataset goodhart_contested.jsonl
```

### Python API
```python
import json

# Load combined test set
with open('data/combined_test.jsonl') as f:
    problems = [json.loads(line) for line in f]

# Analyze consensus examples with model responses
for p in problems:
    if p.get('source') == 'cgrt-consensus-5model':
        # Has full model reasoning traces
        for model, data in p['model_responses'].items():
            print(f"{model}: {data['answer']}")
```

### With HuggingFace Datasets
```python
from datasets import load_dataset

dataset = load_dataset("Adam1010/goodhart-gap-benchmark")
```

## Leaderboard

| Model | Provider | Pass Rate | Notes |
|-------|----------|-----------|-------|
| Claude 3.5 Haiku | Anthropic | 93% | Shows work consistently |
| Claude Sonnet 4 | Anthropic | 79% | |
| gpt-4o | OpenAI | 57% | |
| gpt-4o-mini | OpenAI | 36% | |

## Why This Matters

### For AI Safety
- Models explaining correctly but executing incorrectly are harder to detect
- Gap between capability benchmarks and deployment readiness
- Critical for agentic AI systems

### For Training
- Disagreement cases reveal where models need improvement
- Chain-of-thought consistency matters more than raw capability
- Smaller models (Haiku) can outperform larger ones through reliable execution

## Citation

```bibtex
@dataset{goodhart_gap_benchmark_2026,
  title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
  author={Adam Kruger},
  year={2026},
  url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark},
  note={Built on cgrt-consensus-5model dataset}
}
```

## Related Datasets

- [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model) - Source consensus data

## License

MIT License - free for research and commercial use.

## Acknowledgments

- CGRT (Consensus-Guided Recursive Training) research
- 5-model consensus data collection (~$1000 in API calls)
- Goodhart's Law and its application to AI evaluation