Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up

Adam1010
/
goodhart-gap-benchmark

English
benchmark
reasoning
multi-step
evaluation
llm-evaluation
goodhart
execution-vs-understanding
consensus
multi-model
Model card Files Files and versions
xet
Community
goodhart-gap-benchmark / results
114 kB
  • 1 contributor
History: 1 commit
Adam1010's picture
Adam1010
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
b684ab3 verified 20 days ago
  • claude-3-5-haiku-latest_20260103_182323_results.jsonl
    1.31 kB
    v1.1: Financial domain audit - confirms Goodhart Gap hypothesis 20 days ago
  • claude-3-5-haiku-latest_20260103_182323_summary.json
    289 Bytes
    v1.1: Financial domain audit - confirms Goodhart Gap hypothesis 20 days ago
  • claude-3-5-haiku-latest_20260103_184241_results.jsonl
    33.5 kB
    v1.1: Financial domain audit - confirms Goodhart Gap hypothesis 20 days ago
  • claude-3-5-haiku-latest_20260103_184241_summary.json
    1.34 kB
    v1.1: Financial domain audit - confirms Goodhart Gap hypothesis 20 days ago
  • claude-sonnet-4-20250514_20260103_184954_results.jsonl
    27.8 kB
    v1.1: Financial domain audit - confirms Goodhart Gap hypothesis 20 days ago
  • claude-sonnet-4-20250514_20260103_184954_summary.json
    1.33 kB
    v1.1: Financial domain audit - confirms Goodhart Gap hypothesis 20 days ago
  • gpt-4o-mini_20260103_184617_results.jsonl
    22.9 kB
    v1.1: Financial domain audit - confirms Goodhart Gap hypothesis 20 days ago
  • gpt-4o-mini_20260103_184617_summary.json
    1.36 kB
    v1.1: Financial domain audit - confirms Goodhart Gap hypothesis 20 days ago
  • gpt-4o_20260103_184426_results.jsonl
    23 kB
    v1.1: Financial domain audit - confirms Goodhart Gap hypothesis 20 days ago
  • gpt-4o_20260103_184426_summary.json
    1.32 kB
    v1.1: Financial domain audit - confirms Goodhart Gap hypothesis 20 days ago