|
|
--- |
|
|
license: mit |
|
|
task_categories: |
|
|
- question-answering |
|
|
- text-generation |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- benchmark |
|
|
- reasoning |
|
|
- multi-step |
|
|
- evaluation |
|
|
- llm-evaluation |
|
|
- goodhart |
|
|
- execution-vs-understanding |
|
|
- consensus |
|
|
- multi-model |
|
|
size_categories: |
|
|
- 1K<n<10K |
|
|
--- |
|
|
|
|
|
# Goodhart Gap Benchmark |
|
|
|
|
|
**Detecting the gap between understanding and execution in language models** |
|
|
|
|
|
## Overview |
|
|
|
|
|
The Goodhart Gap Benchmark tests whether language models can correctly *execute* multi-step reasoning tasks that they can correctly *explain*. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them. |
|
|
|
|
|
## Data Sources |
|
|
|
|
|
This benchmark combines two data sources: |
|
|
|
|
|
### 1. CGRT Consensus Dataset (Primary) |
|
|
**Source**: [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model) |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Total problems | 61,678 | |
|
|
| Models queried | 5 (Claude, GPT-4, Gemini, DeepSeek, Qwen) | |
|
|
| API cost | ~$1,000 | |
|
|
| Disagreement cases | 8,050 | |
|
|
| Contested (strongest) | 1,556 | |
|
|
|
|
|
Each problem includes full reasoning traces from 5 frontier models, enabling analysis of where execution diverges despite similar understanding. |
|
|
|
|
|
### 2. Programmatic Multi-Domain Problems |
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Total problems | 101 | |
|
|
| Domains | 12 | |
|
|
| Cost | $0 (generated) | |
|
|
|
|
|
## Dataset Files |
|
|
|
|
|
| File | Description | Count | |
|
|
|------|-------------|-------| |
|
|
| `combined_test.jsonl` | Main evaluation set (contested + programmatic) | 1,657 | |
|
|
| `goodhart_disagreements.jsonl` | All disagreement cases from consensus | 8,050 | |
|
|
| `goodhart_contested.jsonl` | Strongest Goodhart Gap cases | 1,556 | |
|
|
| `test.jsonl` | Programmatic problems only | 101 | |
|
|
|
|
|
## Key Finding |
|
|
|
|
|
**Models that consistently show chain-of-thought execute correctly; models that give quick answers fail.** |
|
|
|
|
|
| Model | Financial Domain | Behavior | |
|
|
|-------|------------------|----------| |
|
|
| Claude 3.5 Haiku | 100% | Always shows work | |
|
|
| Claude Sonnet 4 | 30% | Sometimes skips work | |
|
|
| gpt-4o | 30% | Sometimes skips work | |
|
|
| gpt-4o-mini | 0% | Usually skips work | |
|
|
|
|
|
The financial domain (compound interest + tax) is the strongest Goodhart Gap detector. |
|
|
|
|
|
## Data Format |
|
|
|
|
|
### Consensus-derived examples |
|
|
```json |
|
|
{ |
|
|
"id": "consensus_12345", |
|
|
"domain": "math_consensus", |
|
|
"problem": "A store sells apples for $2 each...", |
|
|
"correct_answer": "15", |
|
|
"source": "cgrt-consensus-5model", |
|
|
"consensus_tier": "contested", |
|
|
"model_responses": { |
|
|
"claude": {"answer": "15", "response": "Step 1..."}, |
|
|
"codex": {"answer": "14", "response": "First..."}, |
|
|
"gemini": {"answer": "15", "response": "Let me..."}, |
|
|
"deepseek": {"answer": "16", "response": "..."}, |
|
|
"qwen": {"answer": "15", "response": "..."} |
|
|
}, |
|
|
"difficulty": "hard" |
|
|
} |
|
|
``` |
|
|
|
|
|
### Programmatic examples |
|
|
```json |
|
|
{ |
|
|
"id": "math_discount_01", |
|
|
"domain": "math_discount", |
|
|
"problem": "A product costs $25 and is on 20% sale...", |
|
|
"correct_answer": "15", |
|
|
"explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0", |
|
|
"source": "programmatic", |
|
|
"difficulty": "easy", |
|
|
"steps": 2 |
|
|
} |
|
|
``` |
|
|
|
|
|
## Consensus Tiers |
|
|
|
|
|
| Tier | Description | Count | |
|
|
|------|-------------|-------| |
|
|
| **Gold** | All 5 models agree | 51,174 | |
|
|
| **Silver** | 4/5 models agree | 5,766 | |
|
|
| **Bronze** | 3/5 models agree | 3,182 | |
|
|
| **Contested** | No majority (strongest Goodhart Gap) | 1,556 | |
|
|
|
|
|
## Domains |
|
|
|
|
|
### From Consensus Data |
|
|
- Math word problems (GSM8K-style) |
|
|
- Multi-step arithmetic |
|
|
- Rate/ratio problems |
|
|
|
|
|
### Programmatic Domains |
|
|
| Domain | Count | Type | |
|
|
|--------|-------|------| |
|
|
| math_discount | 15 | Numerical | |
|
|
| time | 13 | Numerical | |
|
|
| financial | 10 | Numerical | |
|
|
| logic | 8 | Numerical | |
|
|
| recipe | 7 | Numerical | |
|
|
| scheduling | 7 | Numerical | |
|
|
| units | 7 | Numerical | |
|
|
| spatial | 7 | Non-numerical | |
|
|
| procedural | 6 | Non-numerical | |
|
|
| text | 7 | Non-numerical | |
|
|
| sequence | 7 | Non-numerical | |
|
|
| causal | 7 | Non-numerical | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Evaluation |
|
|
```bash |
|
|
# Evaluate on combined test set |
|
|
python evaluate.py --provider anthropic --model claude-3-5-haiku-latest --dataset combined_test.jsonl |
|
|
|
|
|
# Evaluate on contested only (hardest) |
|
|
python evaluate.py --provider openai --model gpt-4o --dataset goodhart_contested.jsonl |
|
|
``` |
|
|
|
|
|
### Python API |
|
|
```python |
|
|
import json |
|
|
|
|
|
# Load combined test set |
|
|
with open('data/combined_test.jsonl') as f: |
|
|
problems = [json.loads(line) for line in f] |
|
|
|
|
|
# Analyze consensus examples with model responses |
|
|
for p in problems: |
|
|
if p.get('source') == 'cgrt-consensus-5model': |
|
|
# Has full model reasoning traces |
|
|
for model, data in p['model_responses'].items(): |
|
|
print(f"{model}: {data['answer']}") |
|
|
``` |
|
|
|
|
|
### With HuggingFace Datasets |
|
|
```python |
|
|
from datasets import load_dataset |
|
|
|
|
|
dataset = load_dataset("Adam1010/goodhart-gap-benchmark") |
|
|
``` |
|
|
|
|
|
## Leaderboard |
|
|
|
|
|
| Model | Provider | Pass Rate | Notes | |
|
|
|-------|----------|-----------|-------| |
|
|
| Claude 3.5 Haiku | Anthropic | 93% | Shows work consistently | |
|
|
| Claude Sonnet 4 | Anthropic | 79% | | |
|
|
| gpt-4o | OpenAI | 57% | | |
|
|
| gpt-4o-mini | OpenAI | 36% | | |
|
|
|
|
|
## Why This Matters |
|
|
|
|
|
### For AI Safety |
|
|
- Models explaining correctly but executing incorrectly are harder to detect |
|
|
- Gap between capability benchmarks and deployment readiness |
|
|
- Critical for agentic AI systems |
|
|
|
|
|
### For Training |
|
|
- Disagreement cases reveal where models need improvement |
|
|
- Chain-of-thought consistency matters more than raw capability |
|
|
- Smaller models (Haiku) can outperform larger ones through reliable execution |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@dataset{goodhart_gap_benchmark_2026, |
|
|
title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs}, |
|
|
author={Adam Kruger}, |
|
|
year={2026}, |
|
|
url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark}, |
|
|
note={Built on cgrt-consensus-5model dataset} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Related Datasets |
|
|
|
|
|
- [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model) - Source consensus data |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - free for research and commercial use. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- CGRT (Consensus-Guided Recursive Training) research |
|
|
- 5-model consensus data collection (~$1000 in API calls) |
|
|
- Goodhart's Law and its application to AI evaluation |
|
|
|