v2.0: Combined with cgrt-consensus-5model data (8,050 disagreements, 1,556 contested)

ca5e3d7 verified 18 days ago

6.33 kB

	---
	license: mit
	task_categories:
	- question-answering
	- text-generation
	language:
	- en
	tags:
	- benchmark
	- reasoning
	- multi-step
	- evaluation
	- llm-evaluation
	- goodhart
	- execution-vs-understanding
	- consensus
	- multi-model
	size_categories:
	- 1K<n<10K
	---

	# Goodhart Gap Benchmark

	Detecting the gap between understanding and execution in language models

	## Overview

	The Goodhart Gap Benchmark tests whether language models can correctly execute multi-step reasoning tasks that they can correctly explain. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.

	## Data Sources

	This benchmark combines two data sources:

	### 1. CGRT Consensus Dataset (Primary)
	Source: [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Total problems \| 61,678 \|
	\| Models queried \| 5 (Claude, GPT-4, Gemini, DeepSeek, Qwen) \|
	\| API cost \| ~$1,000 \|
	\| Disagreement cases \| 8,050 \|
	\| Contested (strongest) \| 1,556 \|

	Each problem includes full reasoning traces from 5 frontier models, enabling analysis of where execution diverges despite similar understanding.

	### 2. Programmatic Multi-Domain Problems
	\| Metric \| Value \|
	\|--------\|-------\|
	\| Total problems \| 101 \|
	\| Domains \| 12 \|
	\| Cost \| $0 (generated) \|

	## Dataset Files

	\| File \| Description \| Count \|
	\|------\|-------------\|-------\|
	\| `combined_test.jsonl` \| Main evaluation set (contested + programmatic) \| 1,657 \|
	\| `goodhart_disagreements.jsonl` \| All disagreement cases from consensus \| 8,050 \|
	\| `goodhart_contested.jsonl` \| Strongest Goodhart Gap cases \| 1,556 \|
	\| `test.jsonl` \| Programmatic problems only \| 101 \|

	## Key Finding

	Models that consistently show chain-of-thought execute correctly; models that give quick answers fail.

	\| Model \| Financial Domain \| Behavior \|
	\|-------\|------------------\|----------\|
	\| Claude 3.5 Haiku \| 100% \| Always shows work \|
	\| Claude Sonnet 4 \| 30% \| Sometimes skips work \|
	\| gpt-4o \| 30% \| Sometimes skips work \|
	\| gpt-4o-mini \| 0% \| Usually skips work \|

	The financial domain (compound interest + tax) is the strongest Goodhart Gap detector.

	## Data Format

	### Consensus-derived examples
	```json
	{
	"id": "consensus_12345",
	"domain": "math_consensus",
	"problem": "A store sells apples for $2 each...",
	"correct_answer": "15",
	"source": "cgrt-consensus-5model",
	"consensus_tier": "contested",
	"model_responses": {
	"claude": {"answer": "15", "response": "Step 1..."},
	"codex": {"answer": "14", "response": "First..."},
	"gemini": {"answer": "15", "response": "Let me..."},
	"deepseek": {"answer": "16", "response": "..."},
	"qwen": {"answer": "15", "response": "..."}
	},
	"difficulty": "hard"
	}
	```

	### Programmatic examples
	```json
	{
	"id": "math_discount_01",
	"domain": "math_discount",
	"problem": "A product costs $25 and is on 20% sale...",
	"correct_answer": "15",
	"explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
	"source": "programmatic",
	"difficulty": "easy",
	"steps": 2
	}
	```

	## Consensus Tiers

	\| Tier \| Description \| Count \|
	\|------\|-------------\|-------\|
	\| Gold \| All 5 models agree \| 51,174 \|
	\| Silver \| 4/5 models agree \| 5,766 \|
	\| Bronze \| 3/5 models agree \| 3,182 \|
	\| Contested \| No majority (strongest Goodhart Gap) \| 1,556 \|

	## Domains

	### From Consensus Data
	- Math word problems (GSM8K-style)
	- Multi-step arithmetic
	- Rate/ratio problems

	### Programmatic Domains
	\| Domain \| Count \| Type \|
	\|--------\|-------\|------\|
	\| math_discount \| 15 \| Numerical \|
	\| time \| 13 \| Numerical \|
	\| financial \| 10 \| Numerical \|
	\| logic \| 8 \| Numerical \|
	\| recipe \| 7 \| Numerical \|
	\| scheduling \| 7 \| Numerical \|
	\| units \| 7 \| Numerical \|
	\| spatial \| 7 \| Non-numerical \|
	\| procedural \| 6 \| Non-numerical \|
	\| text \| 7 \| Non-numerical \|
	\| sequence \| 7 \| Non-numerical \|
	\| causal \| 7 \| Non-numerical \|

	## Usage

	### Quick Evaluation
	```bash
	# Evaluate on combined test set
	python evaluate.py --provider anthropic --model claude-3-5-haiku-latest --dataset combined_test.jsonl

	# Evaluate on contested only (hardest)
	python evaluate.py --provider openai --model gpt-4o --dataset goodhart_contested.jsonl
	```

	### Python API
	```python
	import json

	# Load combined test set
	with open('data/combined_test.jsonl') as f:
	problems = [json.loads(line) for line in f]

	# Analyze consensus examples with model responses
	for p in problems:
	if p.get('source') == 'cgrt-consensus-5model':
	# Has full model reasoning traces
	for model, data in p['model_responses'].items():
	print(f"{model}: {data['answer']}")
	```

	### With HuggingFace Datasets
	```python
	from datasets import load_dataset

	dataset = load_dataset("Adam1010/goodhart-gap-benchmark")
	```

	## Leaderboard

	\| Model \| Provider \| Pass Rate \| Notes \|
	\|-------\|----------\|-----------\|-------\|
	\| Claude 3.5 Haiku \| Anthropic \| 93% \| Shows work consistently \|
	\| Claude Sonnet 4 \| Anthropic \| 79% \| \|
	\| gpt-4o \| OpenAI \| 57% \| \|
	\| gpt-4o-mini \| OpenAI \| 36% \| \|

	## Why This Matters

	### For AI Safety
	- Models explaining correctly but executing incorrectly are harder to detect
	- Gap between capability benchmarks and deployment readiness
	- Critical for agentic AI systems

	### For Training
	- Disagreement cases reveal where models need improvement
	- Chain-of-thought consistency matters more than raw capability
	- Smaller models (Haiku) can outperform larger ones through reliable execution

	## Citation

	```bibtex
	@dataset{goodhart_gap_benchmark_2026,
	title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
	author={Adam Kruger},
	year={2026},
	url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark},
	note={Built on cgrt-consensus-5model dataset}
	}
	```

	## Related Datasets

	- [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model) - Source consensus data

	## License

	MIT License - free for research and commercial use.

	## Acknowledgments

	- CGRT (Consensus-Guided Recursive Training) research
	- 5-model consensus data collection (~$1000 in API calls)
	- Goodhart's Law and its application to AI evaluation