v2.0: Combined with cgrt-consensus-5model data (8,050 disagreements, 1,556 contested)

Browse files

Files changed (7) hide show

.gitattributes +3 -0
README.md +121 -133
create_combined_dataset.py +209 -0
data/combined_summary.json +17 -0
data/combined_test.jsonl +3 -0
data/goodhart_contested.jsonl +3 -0
data/goodhart_disagreements.jsonl +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+data/combined_test.jsonl filter=lfs diff=lfs merge=lfs -text
+data/goodhart_contested.jsonl filter=lfs diff=lfs merge=lfs -text
+data/goodhart_disagreements.jsonl filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -13,8 +13,10 @@ tags:
   - llm-evaluation
   - goodhart
   - execution-vs-understanding
 size_categories:
-  - n<1K
 ---
 # Goodhart Gap Benchmark
@@ -25,182 +27,174 @@ size_categories:
 The Goodhart Gap Benchmark tests whether language models can correctly *execute* multi-step reasoning tasks that they can correctly *explain*. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.
-## Key Finding
-In our testing of 15+ models:
-- **gpt-4o**: 57% pass rate (fails on financial, scheduling, units)
-- **gpt-4o-mini**: 36% pass rate
-- **Claude 3.5 Haiku**: 93% pass rate
-- **Llama 3.1 70B**: Fails the canonical discount calculation despite correct explanation
-## The Canonical Example
-**Problem**: "If a shirt costs $25 and is on 20% sale, and you have a $5 coupon, what do you pay?"
-**Correct answer**: $15 (apply 20% discount first: $25 × 0.8 = $20, then subtract coupon: $20 - $5 = $15)
-When we first ask models to *explain* the procedure, they all correctly state: "First apply the discount, then subtract the coupon."
-When we then ask for the answer, many models fail—giving answers like $16, $17, $22.50, or even $175.
-## Dataset Statistics
 | Metric | Value |
 |--------|-------|
 | Total problems | 101 |
 | Domains | 12 |
-| Difficulty levels | 3 (easy, medium, hard) |
-| Steps per problem | 2-6 |
-### Problems by Domain
-**Numerical Domains (67 problems)**
-| Domain | Count | Description |
-|--------|-------|-------------|
-| math_discount | 15 | Discounts, coupons, taxes, markups |
-| time | 13 | Duration arithmetic, travel times |
-| financial | 10 | Interest, taxes, commissions |
-| logic | 8 | Ordering, deduction, set operations |
-| recipe | 7 | Scaling, unit conversion |
-| scheduling | 7 | Task dependencies, work rates |
-| units | 7 | Unit conversion with operations |
-**Non-Numerical Domains (34 problems)**
-| Domain | Count | Description |
-|--------|-------|-------------|
-| spatial | 7 | Direction tracking, grid navigation, relative positions |
-| procedural | 6 | State machines, undo/redo, procedure following |
-| text | 7 | String manipulation, encoding, word operations |
-| sequence | 7 | Pattern recognition (letters, symbols, words) |
-| causal | 7 | Cause-effect chains, counterfactuals, necessary/sufficient |
-### Difficulty Distribution
-| Difficulty | Count | Description |
-|------------|-------|-------------|
-| Easy | 28 | 2 steps, straightforward |
-| Medium | 32 | 2-3 steps, some complexity |
-| Hard | 7 | 3-4 steps, multiple operations |
 ## Data Format
-Each problem is a JSON object with the following fields:
 ```json
 {
   "id": "math_discount_01",
   "domain": "math_discount",
-  "problem": "A product costs $25 and is on 20% sale. You also have a $5 coupon. What do you pay? Answer with just the number.",
   "correct_answer": "15",
   "explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
-  "understanding_check": "To solve this, first apply the 20% discount, then subtract the coupon. What are the two steps?",
   "difficulty": "easy",
   "steps": 2
 }
 ```
-### Field Descriptions
-| Field | Description |
-|-------|-------------|
-| `id` | Unique identifier (domain_type_number) |
-| `domain` | Category of reasoning required |
-| `problem` | The question posed to the model |
-| `correct_answer` | Expected answer (numeric or text) |
-| `explanation` | Step-by-step solution |
-| `understanding_check` | Prompt to verify model understands the procedure |
-| `difficulty` | easy, medium, or hard |
-| `steps` | Number of sequential operations required |
 ## Usage
 ### Quick Evaluation
 ```bash
-# Install requirements
-pip install requests
-# Evaluate OpenAI model
-python evaluate.py --provider openai --model gpt-4o -v
-# Evaluate Claude model
-python evaluate.py --provider anthropic --model claude-3-5-haiku-latest -v
-# Evaluate local Ollama model
-python evaluate.py --provider ollama --model llama3.1:8b -v
 ```
 ### Python API
 ```python
 import json
-# Load dataset
-problems = []
-with open('data/test.jsonl') as f:
-    for line in f:
-        problems.append(json.loads(line))
-# Test your model
-for problem in problems:
-    response = your_model.generate(problem['problem'])
-    expected = problem['correct_answer']
-    # Validate response against expected
 ```
 ### With HuggingFace Datasets
 ```python
 from datasets import load_dataset
-dataset = load_dataset("your-username/goodhart-gap-benchmark")
-for example in dataset['test']:
-    print(example['problem'])
-    print(f"Expected: {example['correct_answer']}")
 ```
-## Evaluation Criteria
-A response is considered correct if:
-1. **Numeric answers**: The expected number appears in the response (with tolerance for rounding)
-2. **Time answers**: The expected time appears in any reasonable format (e.g., "4:45 PM", "4:45pm", "16:45")
-3. **Yes/no answers**: The response clearly indicates yes, no, or "cannot determine"
-4. **Ordering answers**: Items appear in the correct sequence
 ## Leaderboard
-| Model | Provider | Pass Rate | Weakest Domain |
-|-------|----------|-----------|----------------|
-| Claude 3.5 Haiku | Anthropic | 93% | logic |
-| Claude Sonnet 4 | Anthropic | 79% | financial, scheduling |
-| gpt-4o | OpenAI | 57% | scheduling |
-| gpt-4o-mini | OpenAI | 36% | most domains |
-| Qwen 2.5 72B | Alibaba | TBD | - |
-| Llama 3.1 70B | Meta | TBD | - |
-*Submit your results via PR to add to the leaderboard*
 ## Why This Matters
 ### For AI Safety
-Models that can explain correct procedures but execute them incorrectly are:
-- Harder to detect through explanation-based evaluation
-- More dangerous in agentic settings
-- A gap between capability benchmarks and deployment readiness
-### For Model Selection
-Not all models are equal for multi-step reasoning:
-- Model family matters more than size
-- Distilled models often lose this capability
-- Test execution, not just explanation
 ### For Training
-The gap appears to be a training problem:
-- Well-trained models (Claude Haiku) outperform larger models
-- Suggests targeted fine-tuning could help
 ## Citation
@@ -209,27 +203,21 @@ The gap appears to be a training problem:
   title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
   author={Adam Kruger},
   year={2026},
-  url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark}
 }
 ```
-## License
-MIT License - free for research and commercial use.
-## Contributing
-We welcome contributions:
-- New test cases in underrepresented domains
-- Results from additional models
-- Improved validators
-- Translations to other languages
-Submit issues and PRs at: [GitHub Repository URL]
 ## Acknowledgments
-Research inspired by:
 - Goodhart's Law and its application to AI evaluation
-- Work on multi-step reasoning in LLMs
-- The distinction between System 1 and System 2 thinking

   - llm-evaluation
   - goodhart
   - execution-vs-understanding
+  - consensus
+  - multi-model
 size_categories:
+  - 1K<n<10K
 ---
 # Goodhart Gap Benchmark
 The Goodhart Gap Benchmark tests whether language models can correctly *execute* multi-step reasoning tasks that they can correctly *explain*. Named after Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"), this benchmark reveals a critical failure mode: models that understand procedures but fail to execute them.
+## Data Sources
+This benchmark combines two data sources:
+### 1. CGRT Consensus Dataset (Primary)
+**Source**: [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model)
+| Metric | Value |
+|--------|-------|
+| Total problems | 61,678 |
+| Models queried | 5 (Claude, GPT-4, Gemini, DeepSeek, Qwen) |
+| API cost | ~$1,000 |
+| Disagreement cases | 8,050 |
+| Contested (strongest) | 1,556 |
+Each problem includes full reasoning traces from 5 frontier models, enabling analysis of where execution diverges despite similar understanding.
+### 2. Programmatic Multi-Domain Problems
 | Metric | Value |
 |--------|-------|
 | Total problems | 101 |
 | Domains | 12 |
+| Cost | $0 (generated) |
+## Dataset Files
+| File | Description | Count |
+|------|-------------|-------|
+| `combined_test.jsonl` | Main evaluation set (contested + programmatic) | 1,657 |
+| `goodhart_disagreements.jsonl` | All disagreement cases from consensus | 8,050 |
+| `goodhart_contested.jsonl` | Strongest Goodhart Gap cases | 1,556 |
+| `test.jsonl` | Programmatic problems only | 101 |
+## Key Finding
+**Models that consistently show chain-of-thought execute correctly; models that give quick answers fail.**
+| Model | Financial Domain | Behavior |
+|-------|------------------|----------|
+| Claude 3.5 Haiku | 100% | Always shows work |
+| Claude Sonnet 4 | 30% | Sometimes skips work |
+| gpt-4o | 30% | Sometimes skips work |
+| gpt-4o-mini | 0% | Usually skips work |
+The financial domain (compound interest + tax) is the strongest Goodhart Gap detector.
 ## Data Format
+### Consensus-derived examples
+```json
+{
+  "id": "consensus_12345",
+  "domain": "math_consensus",
+  "problem": "A store sells apples for $2 each...",
+  "correct_answer": "15",
+  "source": "cgrt-consensus-5model",
+  "consensus_tier": "contested",
+  "model_responses": {
+    "claude": {"answer": "15", "response": "Step 1..."},
+    "codex": {"answer": "14", "response": "First..."},
+    "gemini": {"answer": "15", "response": "Let me..."},
+    "deepseek": {"answer": "16", "response": "..."},
+    "qwen": {"answer": "15", "response": "..."}
+  },
+  "difficulty": "hard"
+}
+```
+### Programmatic examples
 ```json
 {
   "id": "math_discount_01",
   "domain": "math_discount",
+  "problem": "A product costs $25 and is on 20% sale...",
   "correct_answer": "15",
   "explanation": "25 × 0.8 = 20.0, then 20.0 - 5 = 15.0",
+  "source": "programmatic",
   "difficulty": "easy",
   "steps": 2
 }
 ```
+## Consensus Tiers
+| Tier | Description | Count |
+|------|-------------|-------|
+| **Gold** | All 5 models agree | 51,174 |
+| **Silver** | 4/5 models agree | 5,766 |
+| **Bronze** | 3/5 models agree | 3,182 |
+| **Contested** | No majority (strongest Goodhart Gap) | 1,556 |
+## Domains
+### From Consensus Data
+- Math word problems (GSM8K-style)
+- Multi-step arithmetic
+- Rate/ratio problems
+### Programmatic Domains
+| Domain | Count | Type |
+|--------|-------|------|
+| math_discount | 15 | Numerical |
+| time | 13 | Numerical |
+| financial | 10 | Numerical |
+| logic | 8 | Numerical |
+| recipe | 7 | Numerical |
+| scheduling | 7 | Numerical |
+| units | 7 | Numerical |
+| spatial | 7 | Non-numerical |
+| procedural | 6 | Non-numerical |
+| text | 7 | Non-numerical |
+| sequence | 7 | Non-numerical |
+| causal | 7 | Non-numerical |
 ## Usage
 ### Quick Evaluation
 ```bash
+# Evaluate on combined test set
+python evaluate.py --provider anthropic --model claude-3-5-haiku-latest --dataset combined_test.jsonl
+# Evaluate on contested only (hardest)
+python evaluate.py --provider openai --model gpt-4o --dataset goodhart_contested.jsonl
 ```
 ### Python API
 ```python
 import json
+# Load combined test set
+with open('data/combined_test.jsonl') as f:
+    problems = [json.loads(line) for line in f]
+# Analyze consensus examples with model responses
+for p in problems:
+    if p.get('source') == 'cgrt-consensus-5model':
+        # Has full model reasoning traces
+        for model, data in p['model_responses'].items():
+            print(f"{model}: {data['answer']}")
 ```
 ### With HuggingFace Datasets
 ```python
 from datasets import load_dataset
+dataset = load_dataset("Adam1010/goodhart-gap-benchmark")
 ```
 ## Leaderboard
+| Model | Provider | Pass Rate | Notes |
+|-------|----------|-----------|-------|
+| Claude 3.5 Haiku | Anthropic | 93% | Shows work consistently |
+| Claude Sonnet 4 | Anthropic | 79% | |
+| gpt-4o | OpenAI | 57% | |
+| gpt-4o-mini | OpenAI | 36% | |
 ## Why This Matters
 ### For AI Safety
+- Models explaining correctly but executing incorrectly are harder to detect
+- Gap between capability benchmarks and deployment readiness
+- Critical for agentic AI systems
 ### For Training
+- Disagreement cases reveal where models need improvement
+- Chain-of-thought consistency matters more than raw capability
+- Smaller models (Haiku) can outperform larger ones through reliable execution
 ## Citation
   title={Goodhart Gap Benchmark: Detecting the Gap Between Understanding and Execution in LLMs},
   author={Adam Kruger},
   year={2026},
+  url={https://huggingface.co/datasets/Adam1010/goodhart-gap-benchmark},
+  note={Built on cgrt-consensus-5model dataset}
 }
 ```
+## Related Datasets
+- [Adam1010/cgrt-consensus-5model](https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model) - Source consensus data
+## License
+MIT License - free for research and commercial use.
 ## Acknowledgments
+- CGRT (Consensus-Guided Recursive Training) research
+- 5-model consensus data collection (~$1000 in API calls)
 - Goodhart's Law and its application to AI evaluation

create_combined_dataset.py ADDED Viewed

	@@ -0,0 +1,209 @@

+#!/usr/bin/env python3
+"""
+Create Combined Goodhart Gap Benchmark
+Combines:
+1. cgrt-consensus-5model data (61,678 problems, ~$1000 in API calls)
+2. Programmatic multi-domain problems (101 problems)
+Focus: Disagreement cases where models show the "Goodhart Gap" -
+understanding procedures but failing execution.
+"""
+import json
+from pathlib import Path
+from collections import defaultdict
+# Paths
+CONSENSUS_DATA = Path("/home/adam/Mojo/Research/experiments/tau-bench/cgrt/data/consensus_cli_labels_enriched.jsonl")
+PROGRAMMATIC_DATA = Path("data/test.jsonl")
+OUTPUT_DIR = Path("data")
+def load_consensus_data():
+    """Load the cgrt-consensus-5model dataset."""
+    data = []
+    with open(CONSENSUS_DATA) as f:
+        for line in f:
+            data.append(json.loads(line))
+    return data
+def load_programmatic_data():
+    """Load our programmatic test problems."""
+    data = []
+    with open(PROGRAMMATIC_DATA) as f:
+        for line in f:
+            data.append(json.loads(line))
+    return data
+def extract_goodhart_examples(consensus_data):
+    """
+    Extract examples that demonstrate the Goodhart Gap:
+    - Models that show work but get wrong answers
+    - Disagreement between models despite similar reasoning
+    - Contested problems where execution differs
+    """
+    goodhart_examples = []
+    for item in consensus_data:
+        # Focus on disagreement cases
+        if item.get('all_agree', True):
+            continue
+        # Get all model answers
+        answers = {}
+        for model in ['claude', 'codex', 'gemini', 'deepseek', 'qwen']:
+            ans_key = f'{model}_answer'
+            resp_key = f'{model}_response'
+            if ans_key in item and resp_key in item:
+                answers[model] = {
+                    'answer': item[ans_key],
+                    'response': item[resp_key]
+                }
+        if len(answers) < 3:
+            continue
+        # Classify the type of disagreement
+        unique_answers = set(a['answer'] for a in answers.values() if a['answer'])
+        example = {
+            'id': f"consensus_{item['idx']}",
+            'source': 'cgrt-consensus-5model',
+            'question': item['question'],
+            'majority_answer': item.get('majority_answer', ''),
+            'agreement_score': item.get('agreement_score', 0),
+            'consensus_tier': item.get('consensus_tier', 'unknown'),
+            'num_unique_answers': len(unique_answers),
+            'model_responses': answers,
+            'outlier_models': item.get('outlier_models', []),
+            'difficulty_signal': item.get('difficulty_signal', 0),
+            'goodhart_type': classify_goodhart_type(answers, item)
+        }
+        goodhart_examples.append(example)
+    return goodhart_examples
+def classify_goodhart_type(answers, item):
+    """Classify the type of Goodhart Gap exhibited."""
+    # Check if models show similar reasoning but different answers
+    responses = [a['response'] for a in answers.values() if a['response']]
+    answer_set = set(a['answer'] for a in answers.values() if a['answer'])
+    if len(answer_set) == 1:
+        return 'agreement'  # Shouldn't happen in disagreement set
+    tier = item.get('consensus_tier', '')
+    if tier == 'contested':
+        return 'execution_divergence'  # Strong Goodhart Gap
+    elif tier == 'bronze':
+        return 'partial_agreement'
+    elif tier == 'silver':
+        return 'minor_disagreement'
+    else:
+        return 'calculation_error'
+def create_combined_dataset():
+    """Create the combined benchmark dataset."""
+    print("Loading consensus data...")
+    consensus_data = load_consensus_data()
+    print(f"  Loaded {len(consensus_data)} consensus problems")
+    print("\nLoading programmatic data...")
+    programmatic_data = load_programmatic_data()
+    print(f"  Loaded {len(programmatic_data)} programmatic problems")
+    print("\nExtracting Goodhart Gap examples...")
+    goodhart_examples = extract_goodhart_examples(consensus_data)
+    print(f"  Found {len(goodhart_examples)} disagreement cases")
+    # Categorize by tier
+    by_tier = defaultdict(list)
+    for ex in goodhart_examples:
+        by_tier[ex['consensus_tier']].append(ex)
+    print("\n  By tier:")
+    for tier, examples in sorted(by_tier.items()):
+        print(f"    {tier}: {len(examples)}")
+    # Create output datasets
+    OUTPUT_DIR.mkdir(exist_ok=True)
+    # 1. Full disagreement dataset
+    print("\nWriting full disagreement dataset...")
+    with open(OUTPUT_DIR / "goodhart_disagreements.jsonl", 'w') as f:
+        for ex in goodhart_examples:
+            f.write(json.dumps(ex) + '\n')
+    print(f"  Wrote {len(goodhart_examples)} examples to goodhart_disagreements.jsonl")
+    # 2. Contested subset (strongest Goodhart Gap cases)
+    contested = by_tier.get('contested', [])
+    print(f"\nWriting contested subset ({len(contested)} examples)...")
+    with open(OUTPUT_DIR / "goodhart_contested.jsonl", 'w') as f:
+        for ex in contested:
+            f.write(json.dumps(ex) + '\n')
+    # 3. Combined test set (contested + programmatic)
+    print("\nCreating combined test set...")
+    combined = []
+    # Add contested examples (reformatted for evaluation)
+    for ex in contested:
+        combined.append({
+            'id': ex['id'],
+            'domain': 'math_consensus',
+            'problem': ex['question'],
+            'correct_answer': ex['majority_answer'],
+            'source': 'cgrt-consensus-5model',
+            'consensus_tier': ex['consensus_tier'],
+            'model_responses': ex['model_responses'],
+            'difficulty': 'hard',
+            'steps': 3  # Estimate
+        })
+    # Add programmatic examples
+    for ex in programmatic_data:
+        ex['source'] = 'programmatic'
+        combined.append(ex)
+    with open(OUTPUT_DIR / "combined_test.jsonl", 'w') as f:
+        for ex in combined:
+            f.write(json.dumps(ex) + '\n')
+    print(f"  Wrote {len(combined)} examples to combined_test.jsonl")
+    # 4. Summary statistics
+    summary = {
+        'total_consensus_problems': len(consensus_data),
+        'total_disagreements': len(goodhart_examples),
+        'contested_count': len(contested),
+        'programmatic_count': len(programmatic_data),
+        'combined_test_count': len(combined),
+        'by_tier': {k: len(v) for k, v in by_tier.items()},
+        'sources': {
+            'cgrt-consensus-5model': 'https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model',
+            'programmatic': 'Python-generated multi-domain problems'
+        },
+        'cost_estimate': '$1000+ in API calls for consensus data'
+    }
+    with open(OUTPUT_DIR / "combined_summary.json", 'w') as f:
+        json.dump(summary, f, indent=2)
+    print("\n" + "="*50)
+    print("COMBINED DATASET SUMMARY")
+    print("="*50)
+    print(f"Consensus source problems: {len(consensus_data)}")
+    print(f"Disagreement cases: {len(goodhart_examples)}")
+    print(f"Contested (strongest): {len(contested)}")
+    print(f"Programmatic problems: {len(programmatic_data)}")
+    print(f"Combined test set: {len(combined)}")
+    print("="*50)
+    return summary
+if __name__ == "__main__":
+    create_combined_dataset()

data/combined_summary.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "total_consensus_problems": 61678,
+  "total_disagreements": 8050,
+  "contested_count": 1556,
+  "programmatic_count": 101,
+  "combined_test_count": 1657,
+  "by_tier": {
+    "silver": 3345,
+    "bronze": 3149,
+    "contested": 1556
+  },
+  "sources": {
+    "cgrt-consensus-5model": "https://huggingface.co/datasets/Adam1010/cgrt-consensus-5model",
+    "programmatic": "Python-generated multi-domain problems"
+  },
+  "cost_estimate": "$1000+ in API calls for consensus data"
+}

data/combined_test.jsonl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:83f932367f73f47956f5b5a90814f26d7e59f19cc6f508f9f5fb4fd75c7f962f
+size 16890961

data/goodhart_contested.jsonl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f76d0ea04bb9a2eb4242edaa90bb49c37f7880d0cc91375b0fc57fdaca8b4dab
+size 17000548

data/goodhart_disagreements.jsonl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:342c69e26ac79b0e6274b084b2b9977d5b0a61aa93d339c14e9a38222b94315b
+size 57061950