---
license: mit
task_categories:
  - question-answering
  - text-generation
language:
  - en
tags:
  - benchmark
  - reasoning
  - multi-step
  - evaluation
  - llm-evaluation
  - goodhart
  - execution-vs-understanding
  - consensus
  - multi-model
size_categories:
  - 1K<n<10K
---

# Goodhart Gap Benchmark

**Detecting the gap between understanding and execution in language models**

## Overview

The Goodhart Gap Benchmark tests whether language models can correctly *execute* multi-step reasoning tasks that they can correctly *explain*. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.

## Data Sources

This benchmark combines two data sources:

### 1. CGRT Consensus Dataset (Primary)
**Source**: [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model)

| Metric | Value |
|--------|-------|
| Total problems | 61,678 |
| Models queried | 5 (Claude, GPT-4, Gemini, DeepSeek, Qwen) |
| API cost | ~$1,000 |
| Disagreement cases | 8,050 |
| Contested (strongest) | 1,556 |

Each problem includes full reasoning traces from 5 frontier models, enabling analysis of where execution diverges despite similar understanding.

### 2. Programmatic Multi-Domain Problems
| Metric | Value |
|--------|-------|
| Total problems | 101 |
| Domains | 12 |
| Cost | $0 (generated) |

## Dataset Files

| File | Description | Count |
|------|-------------|-------|
| `combined_test.jsonl` | Main evaluation set (contested + programmatic) | 1,657 |
| `goodhart_disagreements.jsonl` | All disagreement cases from consensus | 8,050 |
| `goodhart_contested.jsonl` | Strongest Goodhart Gap cases | 1,556 |
| `test.jsonl` | Programmatic problems only | 101 |

## Key Finding

**Models that consistently show chain-of-thought execute correctly; models that give quick answers fail.**

| Model | Financial Domain | Behavior |
|-------|------------------|----------|
| Claude 3.5 Haiku | 100% | Always shows work |
| Claude Sonnet 4 | 30% | Sometimes skips work |
| gpt-4o | 30% | Sometimes skips work |
| gpt-4o-mini | 0% | Usually skips work |

The financial domain (compound interest + tax) is the strongest Goodhart Gap detector.

## Data Format

### Consensus-derived examples
```json
{
  "id": "consensus_12345",
  "domain": "math_consensus",
  "problem": "A store sells apples for $2 each...",
  "correct_answer": "15",
  "source": "cgrt-consensus-5model",
  "consensus_tier": "contested",
  "model_responses": {
    "claude": {"answer": "15", "response": "Step 1..."},
    "codex": {"answer": "14", "response": "First..."},
    "gemini": {"answer": "15", "response": "Let me..."},
    "deepseek": {"answer": "16", "response": "..."},
    "qwen": {"answer": "15", "response": "..."}
  },
  "difficulty": "hard"
}
```

### Programmatic examples
```json
{
  "id": "math_discount_01",
  "domain": "math_discount",
  "problem": "A product costs $25 and is on 20% sale...",
  "correct_answer": "15",
  "explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
  "source": "programmatic",
  "difficulty": "easy",
  "steps": 2
}
```

## Consensus Tiers

| Tier | Description | Count |
|------|-------------|-------|
| **Gold** | All 5 models agree | 51,174 |
| **Silver** | 4/5 models agree | 5,766 |
| **Bronze** | 3/5 models agree | 3,182 |
| **Contested** | No majority (strongest Goodhart Gap) | 1,556 |

## Domains

### From Consensus Data
- Math word problems (GSM8K-style)
- Multi-step arithmetic
- Rate/ratio problems

### Programmatic Domains
| Domain | Count | Type |
|--------|-------|------|
| math_discount | 15 | Numerical |
| time | 13 | Numerical |
| financial | 10 | Numerical |
| logic | 8 | Numerical |
| recipe | 7 | Numerical |
| scheduling | 7 | Numerical |
| units | 7 | Numerical |
| spatial | 7 | Non-numerical |
| procedural | 6 | Non-numerical |
| text | 7 | Non-numerical |
| sequence | 7 | Non-numerical |
| causal | 7 | Non-numerical |

## Usage

### Quick Evaluation
```bash
# Evaluate on combined test set
python evaluate.py --provider anthropic --model claude-3-5-haiku-latest --dataset combined_test.jsonl

# Evaluate on contested only (hardest)
python evaluate.py --provider openai --model gpt-4o --dataset goodhart_contested.jsonl
```

### Python API
```python
import json

# Load combined test set
with open('data/combined_test.jsonl') as f:
    problems = [json.loads(line) for line in f]

# Analyze consensus examples with model responses
for p in problems:
    if p.get('source') == 'cgrt-consensus-5model':
        # Has full model reasoning traces
        for model, data in p['model_responses'].items():
            print(f"{model}: {data['answer']}")
```

### With HuggingFace Datasets
```python
from datasets import load_dataset

dataset = load_dataset("Adam1010/goodhart-gap-benchmark")
```

## Leaderboard

| Model | Provider | Pass Rate | Notes |
|-------|----------|-----------|-------|
| Claude 3.5 Haiku | Anthropic | 93% | Shows work consistently |
| Claude Sonnet 4 | Anthropic | 79% | |
| gpt-4o | OpenAI | 57% | |
| gpt-4o-mini | OpenAI | 36% | |

## Why This Matters

### For AI Safety
- Models explaining correctly but executing incorrectly are harder to detect
- Gap between capability benchmarks and deployment readiness
- Critical for agentic AI systems

### For Training
- Disagreement cases reveal where models need improvement
- Chain-of-thought consistency matters more than raw capability
- Smaller models (Haiku) can outperform larger ones through reliable execution

## Citation

```bibtex
@dataset{goodhart_gap_benchmark_2026,
  title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
  author={Adam Kruger},
  year={2026},
  url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark},
  note={Built on cgrt-consensus-5model dataset}
}
```

## Related Datasets

- [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model) - Source consensus data

## License

MIT License - free for research and commercial use.

## Acknowledgments

- CGRT (Consensus-Guided Recursive Training) research
- 5-model consensus data collection (~$1000 in API calls)
- Goodhart's Law and its application to AI evaluation