rvienne/layton-eval
Preview
•
Updated
•
93
Note Dataset containing layton-eval riddles
Note Dataset containing everything to compute PPI-based benchmark score
Note Benchmark final results on several frontier models