Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
Adam1010
/
goodhart-gap-benchmark
like
0
English
benchmark
reasoning
multi-step
evaluation
llm-evaluation
goodhart
execution-vs-understanding
consensus
multi-model
License:
mit
Model card
Files
Files and versions
xet
Community
main
goodhart-gap-benchmark
/
results
114 kB
1 contributor
History:
1 commit
Adam1010
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
b684ab3
verified
20 days ago
claude-3-5-haiku-latest_20260103_182323_results.jsonl
1.31 kB
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
20 days ago
claude-3-5-haiku-latest_20260103_182323_summary.json
289 Bytes
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
20 days ago
claude-3-5-haiku-latest_20260103_184241_results.jsonl
33.5 kB
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
20 days ago
claude-3-5-haiku-latest_20260103_184241_summary.json
1.34 kB
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
20 days ago
claude-sonnet-4-20250514_20260103_184954_results.jsonl
27.8 kB
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
20 days ago
claude-sonnet-4-20250514_20260103_184954_summary.json
1.33 kB
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
20 days ago
gpt-4o-mini_20260103_184617_results.jsonl
22.9 kB
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
20 days ago
gpt-4o-mini_20260103_184617_summary.json
1.36 kB
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
20 days ago
gpt-4o_20260103_184426_results.jsonl
23 kB
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
20 days ago
gpt-4o_20260103_184426_summary.json
1.32 kB
v1.1: Financial domain audit - confirms Goodhart Gap hypothesis
20 days ago