File size: 6,334 Bytes

b684ab3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca5e3d7
 
b684ab3
ca5e3d7
b684ab3
 
 
 
 
 
 
 
 
 
ca5e3d7
b684ab3
ca5e3d7
b684ab3
ca5e3d7
 
b684ab3
ca5e3d7
 
 
 
 
 
 
b684ab3
ca5e3d7
b684ab3
ca5e3d7
b684ab3
 
 
 
ca5e3d7
b684ab3
ca5e3d7
b684ab3
ca5e3d7
 
 
 
 
 
b684ab3
ca5e3d7
b684ab3
ca5e3d7
b684ab3
ca5e3d7
 
 
 
 
 
b684ab3
ca5e3d7
b684ab3
 
 
ca5e3d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b684ab3
ca5e3d7
b684ab3
 
 
 
ca5e3d7
b684ab3
 
ca5e3d7
b684ab3
 
 
 
 
ca5e3d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b684ab3
 
 
 
 
ca5e3d7
 
b684ab3
ca5e3d7
 
b684ab3
 
 
 
 
 
ca5e3d7
 
 
 
 
 
 
 
 
 
b684ab3
 
 
 
 
 
ca5e3d7
b684ab3
 
 
 
ca5e3d7
 
 
 
 
 
b684ab3
 
 
 
ca5e3d7
 
 
b684ab3
 
ca5e3d7
 
 
b684ab3
 
 
 
 
 
 
 
ca5e3d7
 
b684ab3
 
 
ca5e3d7
b684ab3
ca5e3d7
b684ab3
ca5e3d7
b684ab3
ca5e3d7
b684ab3
 
 
ca5e3d7
 
b684ab3

---
license: mit
task_categories:
  - question-answering
  - text-generation
language:
  - en
tags:
  - benchmark
  - reasoning
  - multi-step
  - evaluation
  - llm-evaluation
  - goodhart
  - execution-vs-understanding
  - consensus
  - multi-model
size_categories:
  - 1K<n<10K
---

# Goodhart Gap Benchmark

**Detecting the gap between understanding and execution in language models**

## Overview

The Goodhart Gap Benchmark tests whether language models can correctly *execute* multi-step reasoning tasks that they can correctly *explain*. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.

## Data Sources

This benchmark combines two data sources:

### 1. CGRT Consensus Dataset (Primary)
**Source**: [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model)

| Metric | Value |
|--------|-------|
| Total problems | 61,678 |
| Models queried | 5 (Claude, GPT-4, Gemini, DeepSeek, Qwen) |
| API cost | ~$1,000 |
| Disagreement cases | 8,050 |
| Contested (strongest) | 1,556 |

Each problem includes full reasoning traces from 5 frontier models, enabling analysis of where execution diverges despite similar understanding.

### 2. Programmatic Multi-Domain Problems
| Metric | Value |
|--------|-------|
| Total problems | 101 |
| Domains | 12 |
| Cost | $0 (generated) |

## Dataset Files

| File | Description | Count |
|------|-------------|-------|
| `combined_test.jsonl` | Main evaluation set (contested + programmatic) | 1,657 |
| `goodhart_disagreements.jsonl` | All disagreement cases from consensus | 8,050 |
| `goodhart_contested.jsonl` | Strongest Goodhart Gap cases | 1,556 |
| `test.jsonl` | Programmatic problems only | 101 |

## Key Finding

**Models that consistently show chain-of-thought execute correctly; models that give quick answers fail.**

| Model | Financial Domain | Behavior |
|-------|------------------|----------|
| Claude 3.5 Haiku | 100% | Always shows work |
| Claude Sonnet 4 | 30% | Sometimes skips work |
| gpt-4o | 30% | Sometimes skips work |
| gpt-4o-mini | 0% | Usually skips work |

The financial domain (compound interest + tax) is the strongest Goodhart Gap detector.

## Data Format

### Consensus-derived examples
```json
{
  "id": "consensus_12345",
  "domain": "math_consensus",
  "problem": "A store sells apples for $2 each...",
  "correct_answer": "15",
  "source": "cgrt-consensus-5model",
  "consensus_tier": "contested",
  "model_responses": {
    "claude": {"answer": "15", "response": "Step 1..."},
    "codex": {"answer": "14", "response": "First..."},
    "gemini": {"answer": "15", "response": "Let me..."},
    "deepseek": {"answer": "16", "response": "..."},
    "qwen": {"answer": "15", "response": "..."}
  },
  "difficulty": "hard"
}
```

### Programmatic examples
```json
{
  "id": "math_discount_01",
  "domain": "math_discount",
  "problem": "A product costs $25 and is on 20% sale...",
  "correct_answer": "15",
  "explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
  "source": "programmatic",
  "difficulty": "easy",
  "steps": 2
}
```

## Consensus Tiers

| Tier | Description | Count |
|------|-------------|-------|
| **Gold** | All 5 models agree | 51,174 |
| **Silver** | 4/5 models agree | 5,766 |
| **Bronze** | 3/5 models agree | 3,182 |
| **Contested** | No majority (strongest Goodhart Gap) | 1,556 |

## Domains

### From Consensus Data
- Math word problems (GSM8K-style)
- Multi-step arithmetic
- Rate/ratio problems

### Programmatic Domains
| Domain | Count | Type |
|--------|-------|------|
| math_discount | 15 | Numerical |
| time | 13 | Numerical |
| financial | 10 | Numerical |
| logic | 8 | Numerical |
| recipe | 7 | Numerical |
| scheduling | 7 | Numerical |
| units | 7 | Numerical |
| spatial | 7 | Non-numerical |
| procedural | 6 | Non-numerical |
| text | 7 | Non-numerical |
| sequence | 7 | Non-numerical |
| causal | 7 | Non-numerical |

## Usage

### Quick Evaluation
```bash
# Evaluate on combined test set
python evaluate.py --provider anthropic --model claude-3-5-haiku-latest --dataset combined_test.jsonl

# Evaluate on contested only (hardest)
python evaluate.py --provider openai --model gpt-4o --dataset goodhart_contested.jsonl
```

### Python API
```python
import json

# Load combined test set
with open('data/combined_test.jsonl') as f:
    problems = [json.loads(line) for line in f]

# Analyze consensus examples with model responses
for p in problems:
    if p.get('source') == 'cgrt-consensus-5model':
        # Has full model reasoning traces
        for model, data in p['model_responses'].items():
            print(f"{model}: {data['answer']}")
```

### With HuggingFace Datasets
```python
from datasets import load_dataset

dataset = load_dataset("Adam1010/goodhart-gap-benchmark")
```

## Leaderboard

| Model | Provider | Pass Rate | Notes |
|-------|----------|-----------|-------|
| Claude 3.5 Haiku | Anthropic | 93% | Shows work consistently |
| Claude Sonnet 4 | Anthropic | 79% | |
| gpt-4o | OpenAI | 57% | |
| gpt-4o-mini | OpenAI | 36% | |

## Why This Matters

### For AI Safety
- Models explaining correctly but executing incorrectly are harder to detect
- Gap between capability benchmarks and deployment readiness
- Critical for agentic AI systems

### For Training
- Disagreement cases reveal where models need improvement
- Chain-of-thought consistency matters more than raw capability
- Smaller models (Haiku) can outperform larger ones through reliable execution

## Citation

```bibtex
@dataset{goodhart_gap_benchmark_2026,
  title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
  author={Adam Kruger},
  year={2026},
  url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark},
  note={Built on cgrt-consensus-5model dataset}
}
```

## Related Datasets

- [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model) - Source consensus data

## License

MIT License - free for research and commercial use.

## Acknowledgments

- CGRT (Consensus-Guided Recursive Training) research
- 5-model consensus data collection (~$1000 in API calls)
- Goodhart's Law and its application to AI evaluation