Goodhart Gap Benchmark

Detecting the gap between understanding and execution in language models

Overview

The Goodhart Gap Benchmark tests whether language models can correctly execute multi-step reasoning tasks that they can correctly explain. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.

Data Sources

This benchmark combines two data sources:

1. CGRT Consensus Dataset (Primary)

Source: Adam1010/cgrt-consensus-5model

Metric	Value
Total problems	61,678
Models queried	5 (Claude, GPT-4, Gemini, DeepSeek, Qwen)
API cost	~$1,000
Disagreement cases	8,050
Contested (strongest)	1,556

Each problem includes full reasoning traces from 5 frontier models, enabling analysis of where execution diverges despite similar understanding.

2. Programmatic Multi-Domain Problems

Metric	Value
Total problems	101
Domains	12
Cost	$0 (generated)

Dataset Files

File	Description	Count
`combined_test.jsonl`	Main evaluation set (contested + programmatic)	1,657
`goodhart_disagreements.jsonl`	All disagreement cases from consensus	8,050
`goodhart_contested.jsonl`	Strongest Goodhart Gap cases	1,556
`test.jsonl`	Programmatic problems only	101

Key Finding

Models that consistently show chain-of-thought execute correctly; models that give quick answers fail.

Model	Financial Domain	Behavior
Claude 3.5 Haiku	100%	Always shows work
Claude Sonnet 4	30%	Sometimes skips work
gpt-4o	30%	Sometimes skips work
gpt-4o-mini	0%	Usually skips work

The financial domain (compound interest + tax) is the strongest Goodhart Gap detector.

Data Format

Consensus-derived examples

{
  "id": "consensus_12345",
  "domain": "math_consensus",
  "problem": "A store sells apples for $2 each...",
  "correct_answer": "15",
  "source": "cgrt-consensus-5model",
  "consensus_tier": "contested",
  "model_responses": {
    "claude": {"answer": "15", "response": "Step 1..."},
    "codex": {"answer": "14", "response": "First..."},
    "gemini": {"answer": "15", "response": "Let me..."},
    "deepseek": {"answer": "16", "response": "..."},
    "qwen": {"answer": "15", "response": "..."}
  },
  "difficulty": "hard"
}

Programmatic examples

{
  "id": "math_discount_01",
  "domain": "math_discount",
  "problem": "A product costs $25 and is on 20% sale...",
  "correct_answer": "15",
  "explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
  "source": "programmatic",
  "difficulty": "easy",
  "steps": 2
}

Consensus Tiers

Tier	Description	Count
Gold	All 5 models agree	51,174
Silver	4/5 models agree	5,766
Bronze	3/5 models agree	3,182
Contested	No majority (strongest Goodhart Gap)	1,556

Domains

From Consensus Data

Math word problems (GSM8K-style)
Multi-step arithmetic
Rate/ratio problems

Programmatic Domains

Domain	Count	Type
math_discount	15	Numerical
time	13	Numerical
financial	10	Numerical
logic	8	Numerical
recipe	7	Numerical
scheduling	7	Numerical
units	7	Numerical
spatial	7	Non-numerical
procedural	6	Non-numerical
text	7	Non-numerical
sequence	7	Non-numerical
causal	7	Non-numerical

Usage

Quick Evaluation

# Evaluate on combined test set
python evaluate.py --provider anthropic --model claude-3-5-haiku-latest --dataset combined_test.jsonl

# Evaluate on contested only (hardest)
python evaluate.py --provider openai --model gpt-4o --dataset goodhart_contested.jsonl

Python API

import json

# Load combined test set
with open('data/combined_test.jsonl') as f:
    problems = [json.loads(line) for line in f]

# Analyze consensus examples with model responses
for p in problems:
    if p.get('source') == 'cgrt-consensus-5model':
        # Has full model reasoning traces
        for model, data in p['model_responses'].items():
            print(f"{model}: {data['answer']}")

With HuggingFace Datasets

from datasets import load_dataset

dataset = load_dataset("Adam1010/goodhart-gap-benchmark")

Leaderboard

Model	Provider	Pass Rate	Notes
Claude 3.5 Haiku	Anthropic	93%	Shows work consistently
Claude Sonnet 4	Anthropic	79%
gpt-4o	OpenAI	57%
gpt-4o-mini	OpenAI	36%

Why This Matters

For AI Safety

Models explaining correctly but executing incorrectly are harder to detect
Gap between capability benchmarks and deployment readiness
Critical for agentic AI systems

For Training

Disagreement cases reveal where models need improvement
Chain-of-thought consistency matters more than raw capability
Smaller models (Haiku) can outperform larger ones through reliable execution

Citation

@dataset{goodhart_gap_benchmark_2026,
  title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
  author={Adam Kruger},
  year={2026},
  url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark},
  note={Built on cgrt-consensus-5model dataset}
}

Related Datasets

Adam1010/cgrt-consensus-5model - Source consensus data

License

MIT License - free for research and commercial use.

Acknowledgments

CGRT (Consensus-Guided Recursive Training) research
5-model consensus data collection (~$1000 in API calls)
Goodhart's Law and its application to AI evaluation

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support