Goodhart Gap Benchmark
Detecting the gap between understanding and execution in language models
Overview
The Goodhart Gap Benchmark tests whether language models can correctly execute multi-step reasoning tasks that they can correctly explain. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.
Data Sources
This benchmark combines two data sources:
1. CGRT Consensus Dataset (Primary)
Source: Adam1010/cgrt-consensus-5model
| Metric | Value |
|---|---|
| Total problems | 61,678 |
| Models queried | 5 (Claude, GPT-4, Gemini, DeepSeek, Qwen) |
| API cost | ~$1,000 |
| Disagreement cases | 8,050 |
| Contested (strongest) | 1,556 |
Each problem includes full reasoning traces from 5 frontier models, enabling analysis of where execution diverges despite similar understanding.
2. Programmatic Multi-Domain Problems
| Metric | Value |
|---|---|
| Total problems | 101 |
| Domains | 12 |
| Cost | $0 (generated) |
Dataset Files
| File | Description | Count |
|---|---|---|
combined_test.jsonl |
Main evaluation set (contested + programmatic) | 1,657 |
goodhart_disagreements.jsonl |
All disagreement cases from consensus | 8,050 |
goodhart_contested.jsonl |
Strongest Goodhart Gap cases | 1,556 |
test.jsonl |
Programmatic problems only | 101 |
Key Finding
Models that consistently show chain-of-thought execute correctly; models that give quick answers fail.
| Model | Financial Domain | Behavior |
|---|---|---|
| Claude 3.5 Haiku | 100% | Always shows work |
| Claude Sonnet 4 | 30% | Sometimes skips work |
| gpt-4o | 30% | Sometimes skips work |
| gpt-4o-mini | 0% | Usually skips work |
The financial domain (compound interest + tax) is the strongest Goodhart Gap detector.
Data Format
Consensus-derived examples
{
"id": "consensus_12345",
"domain": "math_consensus",
"problem": "A store sells apples for $2 each...",
"correct_answer": "15",
"source": "cgrt-consensus-5model",
"consensus_tier": "contested",
"model_responses": {
"claude": {"answer": "15", "response": "Step 1..."},
"codex": {"answer": "14", "response": "First..."},
"gemini": {"answer": "15", "response": "Let me..."},
"deepseek": {"answer": "16", "response": "..."},
"qwen": {"answer": "15", "response": "..."}
},
"difficulty": "hard"
}
Programmatic examples
{
"id": "math_discount_01",
"domain": "math_discount",
"problem": "A product costs $25 and is on 20% sale...",
"correct_answer": "15",
"explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
"source": "programmatic",
"difficulty": "easy",
"steps": 2
}
Consensus Tiers
| Tier | Description | Count |
|---|---|---|
| Gold | All 5 models agree | 51,174 |
| Silver | 4/5 models agree | 5,766 |
| Bronze | 3/5 models agree | 3,182 |
| Contested | No majority (strongest Goodhart Gap) | 1,556 |
Domains
From Consensus Data
- Math word problems (GSM8K-style)
- Multi-step arithmetic
- Rate/ratio problems
Programmatic Domains
| Domain | Count | Type |
|---|---|---|
| math_discount | 15 | Numerical |
| time | 13 | Numerical |
| financial | 10 | Numerical |
| logic | 8 | Numerical |
| recipe | 7 | Numerical |
| scheduling | 7 | Numerical |
| units | 7 | Numerical |
| spatial | 7 | Non-numerical |
| procedural | 6 | Non-numerical |
| text | 7 | Non-numerical |
| sequence | 7 | Non-numerical |
| causal | 7 | Non-numerical |
Usage
Quick Evaluation
# Evaluate on combined test set
python evaluate.py --provider anthropic --model claude-3-5-haiku-latest --dataset combined_test.jsonl
# Evaluate on contested only (hardest)
python evaluate.py --provider openai --model gpt-4o --dataset goodhart_contested.jsonl
Python API
import json
# Load combined test set
with open('data/combined_test.jsonl') as f:
problems = [json.loads(line) for line in f]
# Analyze consensus examples with model responses
for p in problems:
if p.get('source') == 'cgrt-consensus-5model':
# Has full model reasoning traces
for model, data in p['model_responses'].items():
print(f"{model}: {data['answer']}")
With HuggingFace Datasets
from datasets import load_dataset
dataset = load_dataset("Adam1010/goodhart-gap-benchmark")
Leaderboard
| Model | Provider | Pass Rate | Notes |
|---|---|---|---|
| Claude 3.5 Haiku | Anthropic | 93% | Shows work consistently |
| Claude Sonnet 4 | Anthropic | 79% | |
| gpt-4o | OpenAI | 57% | |
| gpt-4o-mini | OpenAI | 36% |
Why This Matters
For AI Safety
- Models explaining correctly but executing incorrectly are harder to detect
- Gap between capability benchmarks and deployment readiness
- Critical for agentic AI systems
For Training
- Disagreement cases reveal where models need improvement
- Chain-of-thought consistency matters more than raw capability
- Smaller models (Haiku) can outperform larger ones through reliable execution
Citation
@dataset{goodhart_gap_benchmark_2026,
title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
author={Adam Kruger},
year={2026},
url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark},
note={Built on cgrt-consensus-5model dataset}
}
Related Datasets
- Adam1010/cgrt-consensus-5model - Source consensus data
License
MIT License - free for research and commercial use.
Acknowledgments
- CGRT (Consensus-Guided Recursive Training) research
- 5-model consensus data collection (~$1000 in API calls)
- Goodhart's Law and its application to AI evaluation