Goodhart Gap Benchmark

Detecting the gap between understanding and execution in language models

Overview

The Goodhart Gap Benchmark tests whether language models can correctly execute multi-step reasoning tasks that they can correctly explain. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.

Data Sources

This benchmark combines two data sources:

1. CGRT Consensus Dataset (Primary)

Source: Adam1010/cgrt-consensus-5model

Metric Value
Total problems 61,678
Models queried 5 (Claude, GPT-4, Gemini, DeepSeek, Qwen)
API cost ~$1,000
Disagreement cases 8,050
Contested (strongest) 1,556

Each problem includes full reasoning traces from 5 frontier models, enabling analysis of where execution diverges despite similar understanding.

2. Programmatic Multi-Domain Problems

Metric Value
Total problems 101
Domains 12
Cost $0 (generated)

Dataset Files

File Description Count
combined_test.jsonl Main evaluation set (contested + programmatic) 1,657
goodhart_disagreements.jsonl All disagreement cases from consensus 8,050
goodhart_contested.jsonl Strongest Goodhart Gap cases 1,556
test.jsonl Programmatic problems only 101

Key Finding

Models that consistently show chain-of-thought execute correctly; models that give quick answers fail.

Model Financial Domain Behavior
Claude 3.5 Haiku 100% Always shows work
Claude Sonnet 4 30% Sometimes skips work
gpt-4o 30% Sometimes skips work
gpt-4o-mini 0% Usually skips work

The financial domain (compound interest + tax) is the strongest Goodhart Gap detector.

Data Format

Consensus-derived examples

{
  "id": "consensus_12345",
  "domain": "math_consensus",
  "problem": "A store sells apples for $2 each...",
  "correct_answer": "15",
  "source": "cgrt-consensus-5model",
  "consensus_tier": "contested",
  "model_responses": {
    "claude": {"answer": "15", "response": "Step 1..."},
    "codex": {"answer": "14", "response": "First..."},
    "gemini": {"answer": "15", "response": "Let me..."},
    "deepseek": {"answer": "16", "response": "..."},
    "qwen": {"answer": "15", "response": "..."}
  },
  "difficulty": "hard"
}

Programmatic examples

{
  "id": "math_discount_01",
  "domain": "math_discount",
  "problem": "A product costs $25 and is on 20% sale...",
  "correct_answer": "15",
  "explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
  "source": "programmatic",
  "difficulty": "easy",
  "steps": 2
}

Consensus Tiers

Tier Description Count
Gold All 5 models agree 51,174
Silver 4/5 models agree 5,766
Bronze 3/5 models agree 3,182
Contested No majority (strongest Goodhart Gap) 1,556

Domains

From Consensus Data

  • Math word problems (GSM8K-style)
  • Multi-step arithmetic
  • Rate/ratio problems

Programmatic Domains

Domain Count Type
math_discount 15 Numerical
time 13 Numerical
financial 10 Numerical
logic 8 Numerical
recipe 7 Numerical
scheduling 7 Numerical
units 7 Numerical
spatial 7 Non-numerical
procedural 6 Non-numerical
text 7 Non-numerical
sequence 7 Non-numerical
causal 7 Non-numerical

Usage

Quick Evaluation

# Evaluate on combined test set
python evaluate.py --provider anthropic --model claude-3-5-haiku-latest --dataset combined_test.jsonl

# Evaluate on contested only (hardest)
python evaluate.py --provider openai --model gpt-4o --dataset goodhart_contested.jsonl

Python API

import json

# Load combined test set
with open('data/combined_test.jsonl') as f:
    problems = [json.loads(line) for line in f]

# Analyze consensus examples with model responses
for p in problems:
    if p.get('source') == 'cgrt-consensus-5model':
        # Has full model reasoning traces
        for model, data in p['model_responses'].items():
            print(f"{model}: {data['answer']}")

With HuggingFace Datasets

from datasets import load_dataset

dataset = load_dataset("Adam1010/goodhart-gap-benchmark")

Leaderboard

Model Provider Pass Rate Notes
Claude 3.5 Haiku Anthropic 93% Shows work consistently
Claude Sonnet 4 Anthropic 79%
gpt-4o OpenAI 57%
gpt-4o-mini OpenAI 36%

Why This Matters

For AI Safety

  • Models explaining correctly but executing incorrectly are harder to detect
  • Gap between capability benchmarks and deployment readiness
  • Critical for agentic AI systems

For Training

  • Disagreement cases reveal where models need improvement
  • Chain-of-thought consistency matters more than raw capability
  • Smaller models (Haiku) can outperform larger ones through reliable execution

Citation

@dataset{goodhart_gap_benchmark_2026,
  title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
  author={Adam Kruger},
  year={2026},
  url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark},
  note={Built on cgrt-consensus-5model dataset}
}

Related Datasets

License

MIT License - free for research and commercial use.

Acknowledgments

  • CGRT (Consensus-Guided Recursive Training) research
  • 5-model consensus data collection (~$1000 in API calls)
  • Goodhart's Law and its application to AI evaluation
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support